You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2022/02/10 16:40:59 UTC

[GitHub] [lucenenet] StanislavPrusac opened a new issue #618: One character is missing in class ASCIIFoldingFilter

StanislavPrusac opened a new issue #618:
URL: https://github.com/apache/lucenenet/issues/618


   I think one character in class ASCIIFoldingFilter is missing
   Character: Ʀ 
   Nº: 422
   UTF-16: 01A6
   
   Source code that might need to be added to method 
   FoldToASCII(char[] input, int inputPos, char[] output, int outputPos, int length):
   
   ```
   case '\u01A6': // Ʀ  [LATIN LETTER YR] 
   output[outputPos++] = 'R';
   ```
   
   Links about this character: 
   [https://codepoints.net/U+01A6](https://codepoints.net/U+01A6)  
   [https://en.wikipedia.org/wiki/%C6%A6](https://en.wikipedia.org/wiki/%C6%A6)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057370754


   Ok, some advance here, 
   
   ```
   protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               Tokenizer tokenizer = new KeywordTokenizer(reader);
               TokenStream result = new StandardTokenizer(_matchVersion, reader);
                   //StopFilter(_matchVersion, tokenizer, m_stopwords);
               result = new LowerCaseFilter(_matchVersion, tokenizer);
               result = new ICUFoldingFilter(result);
               result = new StandardFilter(_matchVersion, result);
               result = new ASCIIFoldingFilter(result);
               return new TokenStreamComponents(tokenizer, result);
           }
   ```
   
   Changed analyzer for the TokenStream  restult to start from StandarTokenizer instead of  StopFilter and its filtering at least (previous version up in previous post wasn't even filtering)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 edited a comment on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 edited a comment on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057713660


   @diegolaz79 
   
   Nope, it isn't valid to use multiple tokenizers in the same Analyzer, as there are [strict consuming rules to adhere to](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/core/Lucene.Net.Analysis.TokenStream.html).
   
   It would be great to build code analysis components to ensure developers adhere to these tokenizer rules while typing, such as the [rule that ensures `TokenStream` classes are sealed or use a sealed `IncrementToken()` method](https://github.com/apache/lucenenet/tree/e75b86ecc5ed8219100ec7a0b472e7a4454ec013/src/dotnet/Lucene.Net.CodeAnalysis.CSharp) (contributions welcome). It is not likely we will add any additional code analyzers prior to the 4.8.0 release unless they are contributed by the community, though, as these are not blocking the release. For the time being, the best way to ensure custom analyzers adhere to the rules are to test them with [Lucene.Net.TestFramework](https://www.nuget.org/packages/Lucene.Net.TestFramework), which also hits them with multiple threads, random cultures, and random strings of text to ensure they are robust.
   
   I built a demo showing how to setup testing on custom analyzers here: https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo (as well as showing how the above example fails the tests). The functioning analyzer just uses a `WhiteSpaceTokenizer` and `ICUFoldingFilter`. Of course, you may wish to add additional test conditions to ensure your custom analyzer meets your expectations, and then you can experiment with different tokenizers and adding or rearranging filters until you find a solution that meets all of your requirements (as well as plays by Lucene's rules). And of course, you can then later add additional conditions as you discover issues.
   
   ```c#
   using Lucene.Net.Analysis;
   using Lucene.Net.Analysis.Core;
   using Lucene.Net.Analysis.Icu;
   using Lucene.Net.Util;
   using System.IO;
   
   namespace LuceneExtensions
   {
       public sealed class CustomAnalyzer : Analyzer
       {
           private readonly LuceneVersion matchVersion;
   
           public CustomAnalyzer(LuceneVersion matchVersion)
           {
               this.matchVersion = matchVersion;
           }
   
           protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               // Tokenize...
               Tokenizer tokenizer = new WhitespaceTokenizer(matchVersion, reader);
               TokenStream result = tokenizer;
   
               // Filter...
               result = new ICUFoldingFilter(result);
   
               // Return result...
               return new TokenStreamComponents(tokenizer, result);
           }
       }
   }
   ```
   
   ```c#
   using Lucene.Net.Analysis;
   using NUnit.Framework;
   
   namespace LuceneExtensions.Tests
   {
       public class TestCustomAnalyzer : BaseTokenStreamTestCase
       {
           [Test]
           public virtual void TestRemoveAccents()
           {
               Analyzer a = new CustomAnalyzer(TEST_VERSION_CURRENT);
   
               // removal of latin accents (composed)
               AssertAnalyzesTo(a, "résumé", new string[] { "resume" });
   
               // removal of latin accents (decomposed)
               AssertAnalyzesTo(a, "re\u0301sume\u0301", new string[] { "resume" });
   
               // removal of latin accents (multi-word)
               AssertAnalyzesTo(a, "Carlos Pírez", new string[] { "carlos", "pirez" });
           }
       }
   }
   ```
   
   For other ideas about what test conditions you may use, I suggest having a look at Lucene.Net's [extensive analyzer tests](https://github.com/apache/lucenenet/tree/e75b86ecc5ed8219100ec7a0b472e7a4454ec013/src/Lucene.Net.Tests.Analysis.Common/Analysis) including the [ICU tests](https://github.com/apache/lucenenet/tree/e75b86ecc5ed8219100ec7a0b472e7a4454ec013/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu). You may also refer to the tests to see if you can find a similar use case to yours for building queries (although do note that the tests don't show .NET best practices for disposing objects).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057114726


   @NightOwl888 where do you use that GetTokenStream?
   I tried creating this custom analyzer 
   `public class CustomAnalyzer : Analyzer
   {
       LuceneVersion matchVersion;
   
       public CustomAnalyzer(LuceneVersion p_matchVersion) : base()
       {
           matchVersion = p_matchVersion;
       }
       protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
       {
           Tokenizer tokenizer = new KeywordTokenizer(reader);
           TokenStream result = new StopFilter(matchVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);            
           result = new LowerCaseFilter(matchVersion, result); 
           result = new StandardFilter(matchVersion, result);
           result = new ASCIIFoldingFilter(result);
           return new TokenStreamComponents(tokenizer, result);
          
       }
   }`
   but with no luck since even after recreating the index and searching with that customanalyzer, I still cant search by ignoring the accents. Im trying to search for example "perez" and find "pérez" too
   As LuceneVersion I'm using **LuceneVersion.LUCENE_48**
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 edited a comment on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 edited a comment on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1035939483


   Thanks for the report.
   
   As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the [ASCIIFoldingFilter](https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java) in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example, `Lucene.Net.Analysis.Common` all came from 4.8.1), the change you are suggesting isn't even reflected in the [ASCIIFoldingFilter in the latest commit](https://github.com/apache/lucene/blob/8ac26737913d0c1555019e93bc6bf7db1ab9047e/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java#L1154-L1173).
   
   If you wish to pursue adding more characters to `ASCIIFoldingFilter`, I suggest you take it up with the Lucene design team on their [dev mailing list](https://lucene.apache.org/core/discussion.html).
   
   However, do note this isn't the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives:
   
   1. [Nomalizer2Filter](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/icu/Lucene.Net.Analysis.Icu.ICUNormalizer2Filter.html)
   2. [ICUFoldingFilter](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/icu/Lucene.Net.Analysis.Icu.ICUFoldingFilter.html)
   
   Note that you can also create a custom folding filter by using a similar approach in the [ICUFoldingFilter implementation](https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00015/src/Lucene.Net.Analysis.ICU/Analysis/Icu/ICUFoldingFilter.cs/#L66) (ported from Lucene 7.1.0). There is a [tool you can port](https://github.com/apache/lucene/blob/releases/lucene-solr/7.1.0/lucene/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/GenerateUTR30DataFiles.java) to generate a `.nrm` binary file from modified versions of [these text files](https://github.com/apache/lucene/tree/releases/lucene-solr/7.1.0/lucene/analysis/icu/src/data/utr30). The `.nrm` file can then be provided to the constructor of `ICU4N.Text.Normalizer2` - more about the data format can be found in the [ICU normalization docs](https://unicode-org.github.io/icu/userguide/transforms/normalization/). Note that the `.nrm` file is the same binary format used in C++ and Java.
   
   Alternatively, if you wish to extend the `ASCIIFoldingFilter` with your own custom brew of characters, you can simply chain your own filter to `ASCIIFoldingFilter` as pointed out in [this article](https://www.extutorial.com/en/share/1404275).
   
   ```c#
   public TokenStream GetTokenStream(string fieldName, TextReader reader)
   {
       TokenStream result = new StandardTokenizer(reader);
       result = new StandardFilter(result);
       result = new LowerCaseFilter(result);
       // etc etc ...
       result = new StopFilter(result, yourSetOfStopWords);
       result = new MyCustomFoldingFilter(result);
       result = new ASCIIFoldingFilter(result);
       return result;
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1059131895


   Thanks! Just removed the LowerCase filter and changed the StandardFilter for the stopfilter and its working fine with casing and diactrics searches.
   Still need to adjust the stopwording for something more suitable for spanish, but its working well like this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1060995890


   Thanks again! Your suggestions helped me a lot! 
   
   I'm currently doing it like this
   
   ```
   IDictionary<string, Analyzer> myAnalyzerPerField = new Dictionary<string, Analyzer>();
   myAnalyzerPerField ["code"] = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
   finalAnalyzer = new PerFieldAnalyzerWrapper(new CustomAnalyzer(LuceneVersion.LUCENE_48), myAnalyzerPerField );
   ```
   The WhitespaceAnalyzer did not help my case of the code format ("M-12-14", "B-10-39", etc) but will try other more suitable.
   
   And using the finalAnalyer for indexing and search.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1035939483


   Thanks for the report.
   
   As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the [ASCIIFoldingFilter](https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java) in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example, `Lucene.Net.Analysis.Common` all came from 4.8.1), the change you are suggesting isn't even reflected in the [ASCIIFoldingFilter in the latest commit](https://github.com/apache/lucene/blob/8ac26737913d0c1555019e93bc6bf7db1ab9047e/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java#L1154-L1173).
   
   If you wish to pursue adding more characters to `ASCIIFoldingFilter`, I suggest you take it up with the Lucene design team on their [dev mailing list](https://lucene.apache.org/core/discussion.html).
   
   However, do note this isn't the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives:
   
   1. [Nomalizer2Filter](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/icu/Lucene.Net.Analysis.Icu.ICUNormalizer2Filter.html)
   2. [ICUFoldingFilter](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/icu/Lucene.Net.Analysis.Icu.ICUFoldingFilter.html)
   
   Note that you can also create a custom folding filter by using a similar approach in the [ICUFoldingFilter implementation](https://unicode-org.github.io/icu/userguide/transforms/normalization/) (ported from Lucene 7.1.0). There is a [tool you can port](https://github.com/apache/lucene/blob/releases/lucene-solr/7.1.0/lucene/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/GenerateUTR30DataFiles.java) to generate a `.nrm` binary file from modified versions of [these text files](https://github.com/apache/lucene/tree/releases/lucene-solr/7.1.0/lucene/analysis/icu/src/data/utr30). The `.nrm` file can then be provided to the constructor of `ICU4N.Text.Normalizer2` - more about the data format can be found in the [ICU normalization docs](https://unicode-org.github.io/icu/userguide/transforms/normalization/). Note that the `.nrm` file is the same binary format used in C++ and Java.
   
   Alternatively, if you wish to extend the `ASCIIFoldingFilter` with your own custom brew of characters, you can simply chain your own filter to `ASCIIFoldingFilter` as pointed out in [this article](https://www.extutorial.com/en/share/1404275).
   
   ```c#
   public TokenStream GetTokenStream(string fieldName, TextReader reader)
   {
       TokenStream result = new StandardTokenizer(reader);
       result = new StandardFilter(result);
       result = new LowerCaseFilter(result);
       // etc etc ...
       result = new StopFilter(result, yourSetOfStopWords);
       result = new MyCustomFoldingFilter(result);
       result = new ASCIIFoldingFilter(result);
       return result;
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057333375


   Thanks @NightOwl888 for the quick response! So added the necesary package to VS for ICU and changed the analyzer with
   `protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               Tokenizer tokenizer = new KeywordTokenizer(reader);
               //TokenStream result = new StopFilter(_matchVersion, tokenizer, m_stopwords);
               TokenStream result = new LowerCaseFilter(_matchVersion, tokenizer);
               result = new ICUFoldingFilter(result);
               result = new StandardFilter(_matchVersion, result);
               result = new ASCIIFoldingFilter(result);
               return new TokenStreamComponents(tokenizer, result);
           }`
   But I must be missing something because its still not looking for case changes nor accent. Am I missing something?
   
   For search I'm bulding this query: 
   `const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;            
       var dir = FSDirectory.Open(HostingEnvironment.ApplicationPhysicalPath + CurrentIndexPath);
       //var analyzer = new StandardAnalyzer(AppLuceneVersion);
   	var analyzer = new CustomAnalyzer(AppLuceneVersion);
   	var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);	
   	IndexWriter writer = new IndexWriter(dir, indexConfig);
   	var reader = writer.GetReader(applyAllDeletes: true);
   	var searcher = new IndexSearcher(reader);
   	BooleanQuery booleanQuery = new BooleanQuery(true);
   	var terms = searchTerms.Split(' ');
   	foreach (string s in terms)
   	{
   		BooleanQuery subQuery = new BooleanQuery(true);
   		string searchTerm = s;//.ToLower();
   		if (!searchTerm.EndsWith("*"))
   		{
   			searchTerm = searchTerm.Trim() + "*";
   		}
   		if (!searchTerm.StartsWith("*"))
   		{
   			searchTerm = "*" + searchTerm.Trim();
   		}
   		foreach (string field in searchFieldsApplicants) //THIS HAS A LIST OF FIELDS INDEXED, STORED AND PART OF THE SAVE DOCUMENT
   		{
   			Query q = new WildcardQuery(new Term(field, searchTerm));
   			subQuery.Add(q, Occur.SHOULD);
   		}
   		booleanQuery.Add(subQuery, Occur.MUST);                
   	}
   
       var topResults = searcher.Search(bq, numberOfResults).ScoreDocs;`
   
   For example, on the original database and saved to the index I had "Carlos Pírez"
   If I search "carlos Pírez" (with lowe c) or "Carlos Pirez" (without accent) and get no results... the case situation I can solve by lowering while indexing, then making TitleCase on show, but not sure how to solve the accent issue.
   
   It seems like I'm missing something to **_force_** those filters added to the analyer, right?
   
   Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 edited a comment on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 edited a comment on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1059295238


   One thing I noticed, there is a field that has the format like "M-4-20" or "B-7-68" ...
   `new StringField("code",  code,  Field.Store.YES)`
   but when searching that field with the above analyzer, it cant find the dashes
    ```
   searchTerm = "*" + searchTerm + "*";
    Query q = new WildcardQuery(new Term(field, searchTerm));
   ```
   is there a way to escape the dash from the term or skip analysis from that field?
   thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 edited a comment on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 edited a comment on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057333375


   Thanks @NightOwl888 for the quick response! So added the necesary package to VS for ICU and changed the analyzer with
   ```
   `protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               Tokenizer tokenizer = new KeywordTokenizer(reader);
               //TokenStream result = new StopFilter(_matchVersion, tokenizer, m_stopwords);
               TokenStream result = new LowerCaseFilter(_matchVersion, tokenizer);
               result = new ICUFoldingFilter(result);
               result = new StandardFilter(_matchVersion, result);
               result = new ASCIIFoldingFilter(result);
               return new TokenStreamComponents(tokenizer, result);
           }`
   ```
   But I must be missing something because its still not looking for case changes nor accent. Am I missing something?
   
   For search I'm bulding this query: 
   ```
   `const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;            
       var dir = FSDirectory.Open(HostingEnvironment.ApplicationPhysicalPath + CurrentIndexPath);
       //var analyzer = new StandardAnalyzer(AppLuceneVersion);
   	var analyzer = new CustomAnalyzer(AppLuceneVersion);
   	var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);	
   	IndexWriter writer = new IndexWriter(dir, indexConfig);
   	var reader = writer.GetReader(applyAllDeletes: true);
   	var searcher = new IndexSearcher(reader);
   	BooleanQuery booleanQuery = new BooleanQuery(true);
   	var terms = searchTerms.Split(' ');
   	foreach (string s in terms)
   	{
   		BooleanQuery subQuery = new BooleanQuery(true);
   		string searchTerm = s;//.ToLower();
   		if (!searchTerm.EndsWith("*"))
   		{
   			searchTerm = searchTerm.Trim() + "*";
   		}
   		if (!searchTerm.StartsWith("*"))
   		{
   			searchTerm = "*" + searchTerm.Trim();
   		}
   		foreach (string field in searchFieldsApplicants) //THIS HAS A LIST OF FIELDS INDEXED, STORED AND PART OF THE SAVE DOCUMENT
   		{
   			Query q = new WildcardQuery(new Term(field, searchTerm));
   			subQuery.Add(q, Occur.SHOULD);
   		}
   		booleanQuery.Add(subQuery, Occur.MUST);                
   	}
   
       var topResults = searcher.Search(bq, numberOfResults).ScoreDocs;`
   ```
   
   For example, on the original database and saved to the index I had "Carlos Pírez"
   If I search "carlos Pírez" (with lowe c) or "Carlos Pirez" (without accent) and get no results... the case situation I can solve by lowering while indexing, then making TitleCase on show, but not sure how to solve the accent issue.
   
   It seems like I'm missing something to **_force_** those filters added to the analyer, right?
   
   Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1061766386


   Ok, still using WhitespaceAnalyzer for those special columns, problem was I was lowecasing the search term for the CustomAnalyzer. So for those columns I actually uppercase it since I know its a column it only holds upercase characters and the analyzer doesn't lower case it. 
   Thanks for the PerFieldAnalyerWrapper pointer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1075686544


   This seems to have been resolved, now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 closed issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 closed issue #618:
URL: https://github.com/apache/lucenenet/issues/618


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1058654152


   Thanks again!!
   My current version is working fine with your suggestions:
   ```
   protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               StandardTokenizer standardTokenizer = new StandardTokenizer(_matchVersion, reader);
               TokenStream stream = new StandardFilter(_matchVersion, standardTokenizer);
               stream = new LowerCaseFilter(_matchVersion, stream);
               stream = new ICUFoldingFilter(stream);
               return new TokenStreamComponents(standardTokenizer, stream);            
           }
   ```
   
   I just lower case the text the user puts for searching and I find all combinations of accent and case.
   THANKS!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057158478


   @diegolaz79 
   
   My bad. It looks like the example I pulled was from an older version of Lucene. However, ["Answer 2" in this link](https://www.extutorial.com/en/share/1404275) shows an example from 4.9.0, which is similar enough to 4.8.0.
   
   ```c#
   // Accent insensitive analyzer
   public class CustomAnalyzer : StopwordAnalyzerBase {
       public CustomAnalyzer (LuceneVersion matchVersion)
           : base(matchVersion, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
       {
       }
   
       protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
       {
           Tokenizer tokenizer = new KeywordTokenizer(reader);
           TokenStream result = new StopFilter(m_matchVersion, tokenizer, m_stopwords);            
           result = new LowerCaseFilter(matchVersion, result);
           result = new CustomFoldingFilter(result);
           result = new StandardFilter(matchVersion, result);
           result = new ASCIIFoldingFilter(result);
       }
   }
   ```
   
   And of course, the whole idea of the last example is to implement another folding filter named `CustomFoldingFilter` similar to `ASCIIFoldingFilter` that adds your own folding rules that is executed before `ASCIIFoldingFilter`.
   
   Alternatively, use `ICUFoldingFilter`, which implements [UTR #30](http://www.unicode.org/reports/tr30/tr30-4.html) (includes accent removal).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057713660


   @diegolaz79 
   
   Nope, it isn't valid to use multiple tokenizers in the same Analyzer, as there are [strict consuming rules to adhere to](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/core/Lucene.Net.Analysis.TokenStream.html).
   
   It would be great to build code analysis components to ensure developers adhere to these tokenizer rules while typing, such as the [rule that ensures `TokenStream` classes are sealed or use a sealed `IncrementToken()` method](https://github.com/apache/lucenenet/tree/e75b86ecc5ed8219100ec7a0b472e7a4454ec013/src/dotnet/Lucene.Net.CodeAnalysis.CSharp) (contributions welcome). It is not likely we will add any additional code analyzers prior to the 4.8.0 release unless they are contributed by the community, though, as these are not blocking the release. For the time being, the best way to ensure custom analyzers adhere to the rules are to test them with [Lucene.Net.TestFramework](https://www.nuget.org/packages/Lucene.Net.TestFramework), which also hits them with multiple threads, random cultures, and random strings of text to ensure they are robust.
   
   I built a demo showing how to setup testing on custom analyzers here: https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo (as well as showing how the above example fails the tests). The functioning analyzer just uses a `WhiteSpaceTokenizer` and `ICUTokenFilter`. Of course, you may wish to add additional test conditions to ensure your custom analyzer meets your expectations, and then you can experiment with different tokenizers and adding or rearranging filters until you find a solution that meets all of your requirements (as well as plays by Lucene's rules). And of course, you can then later add additional conditions as you discover issues.
   
   ```c#
   using Lucene.Net.Analysis;
   using Lucene.Net.Analysis.Core;
   using Lucene.Net.Analysis.Icu;
   using Lucene.Net.Util;
   using System.IO;
   
   namespace LuceneExtensions
   {
       public sealed class CustomAnalyzer : Analyzer
       {
           private readonly LuceneVersion matchVersion;
   
           public CustomAnalyzer(LuceneVersion matchVersion)
           {
               this.matchVersion = matchVersion;
           }
   
           protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               // Tokenize...
               Tokenizer tokenizer = new WhitespaceTokenizer(matchVersion, reader);
               TokenStream result = tokenizer;
   
               // Filter...
               result = new ICUFoldingFilter(result);
   
               // Return result...
               return new TokenStreamComponents(tokenizer, result);
           }
       }
   }
   ```
   
   ```c#
   using Lucene.Net.Analysis;
   using NUnit.Framework;
   
   namespace LuceneExtensions.Tests
   {
       public class TestCustomAnalyzer : BaseTokenStreamTestCase
       {
           [Test]
           public virtual void TestRemoveAccents()
           {
               Analyzer a = new CustomAnalyzer(TEST_VERSION_CURRENT);
   
               // removal of latin accents (composed)
               AssertAnalyzesTo(a, "résumé", new string[] { "resume" });
   
               // removal of latin accents (decomposed)
               AssertAnalyzesTo(a, "re\u0301sume\u0301", new string[] { "resume" });
   
               // removal of latin accents (multi-word)
               AssertAnalyzesTo(a, "Carlos Pírez", new string[] { "carlos", "pirez" });
           }
       }
   }
   ```
   
   For other ideas about what test conditions you may use, I suggest having a look at Lucene.Net's [extensive analyzer tests](https://github.com/apache/lucenenet/tree/e75b86ecc5ed8219100ec7a0b472e7a4454ec013/src/Lucene.Net.Tests.Analysis.Common/Analysis) including the [ICU tests](https://github.com/apache/lucenenet/tree/e75b86ecc5ed8219100ec7a0b472e7a4454ec013/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu). You may also refer to the tests to see if you can find a similar use case to yours for building queries (although do note that the tests don't show .NET best practices for disposing objects).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 edited a comment on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 edited a comment on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057370754


   Ok, some advance here, 
   
   ```
   protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               Tokenizer tokenizer = new KeywordTokenizer(reader);
               TokenStream result = new StandardTokenizer(_matchVersion, reader);
               result = new LowerCaseFilter(_matchVersion, result);
               result = new ICUFoldingFilter(result);
               result = new StandardFilter(_matchVersion, result);
               result = new ASCIIFoldingFilter(result);
               return new TokenStreamComponents(tokenizer, result);
           }
   ```
   
   Changed analyzer for the TokenStream  restult to start from StandarTokenizer instead of  StopFilter and its filtering at least (previous version up in previous post wasn't even filtering)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1057744203


   FYI - there is also another demo showing additional ways to build analyzers here: https://github.com/NightOwl888/LuceneNetDemo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1059399079


   > Thanks! Just removed the LowerCase filter and changed the StandardFilter for the stopfilter and its working fine with casing and diactrics searches. Still need to adjust the stopwording for something more suitable for spanish, but its working well like this.
   
   FYI - There is a [generic Spanish stop word list](https://github.com/apache/lucenenet/blob/50dd72e23dcd0f2bad639ed164bc4a308cb9500c/src/Lucene.Net.Analysis.Common/Analysis/Snowball/spanish_stop.txt) that can be accessed through [SpanishAnalyzer.DefaultStopSet](https://github.com/apache/lucenenet/blob/50dd72e23dcd0f2bad639ed164bc4a308cb9500c/src/Lucene.Net.Analysis.Common/Analysis/Es/SpanishAnalyzer.cs#L54).
   
   > One thing I noticed, there is a field that has the format like "M-4-20" or "B-7-68" ... `new StringField("code", code, Field.Store.YES)` but when searching that field with the above analyzer, it cant find the dashes
   > 
   > ```
   > searchTerm = "*" + searchTerm + "*";
   > Query q = new WildcardQuery(new Term(field, searchTerm));
   > ```
   > 
   > is there a way to escape the dash from the term or skip analysis from that field? thanks!
   
   [`PerFieldAnalyzerWrapper`](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/analysis-common/Lucene.Net.Analysis.Miscellaneous.PerFieldAnalyzerWrapper.html) applies a different analyzer to each field ([example](https://github.com/NightOwl888/LuceneNetDemo/blob/0b50b3546187212783a1576b7b35df93f087af85/LuceneNetDemo/GitHubIndex.cs#L58-L88)). Note you don't necessarily have to use inline analyzers, you can also simply new up pre-constructed analyzers for each field.
   
   If all of the data in the field can be considered a token, there is a [`KeywordAnalyzer`](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/analysis-common/Lucene.Net.Analysis.Core.KeywordAnalyzer.html) that can be used to keep the entire field together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1058810622


   Just out of curiosity, do all of your use cases work without the `LowerCaseFilter`?
   
   Lowercasing is not the same as case folding (which is what `ICUFoldingFilter` does):
   
   - *Lowercasing:* Converts the entire string from uppercase to lowercase _in the invariant culture_.
   - *Case folding:* Folds the case while handling international special cases such as the [infamous Turkish uppercase dotted i](http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html) and the German "ß" (among others).
   
   ```c#
               AssertAnalyzesTo(a, "Fuß", new string[] { "fuss" }); // German
   
               AssertAnalyzesTo(a, "QUİT", new string[] { "quit" }); // Turkish
   ```
   
   While this might not matter for your use case, it is also worth noting that performance will be improved without the `LowerCaseFilter`.
   
   In addition, search performance and accuracy can be improved by using a `StopFilter` with a reasonable stop word set to cover your use cases - the only reason I removed it from the demo was because the question was about removing diacritics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 edited a comment on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

NightOwl888 edited a comment on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1058810622


   Just out of curiosity, do all of your use cases work without the `LowerCaseFilter`?
   
   Lowercasing is not the same as case folding (which is what `ICUFoldingFilter` does):
   
   - *Lowercasing:* Converts the entire string from uppercase to lowercase _in the invariant culture_.
   - *Case folding:* Folds the case while handling international special cases such as the [infamous Turkish uppercase dotted i](http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html) and the German "ß" (among others).
   
   ```c#
               AssertAnalyzesTo(a, "Fuß", new string[] { "fuss" }); // German
   
               AssertAnalyzesTo(a, "QUİT", new string[] { "quit" }); // Turkish
   ```
   
   [Case Mapping and Case Folding](https://www.w3.org/TR/charmod-norm/#definitionCaseFolding)
   
   While this might not matter for your use case, it is also worth noting that performance will be improved without the `LowerCaseFilter`.
   
   In addition, search performance and accuracy can be improved by using a `StopFilter` with a reasonable stop word set to cover your use cases - the only reason I removed it from the demo was because the question was about removing diacritics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] diegolaz79 commented on issue #618: One character is missing in class ASCIIFoldingFilter

Posted by GitBox <gi...@apache.org>.

diegolaz79 commented on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1059295238


   One thing I noticed, there is a field that has the format like "M-4-20" or "B-7-68" ...
   `new StringField("code",  code,  Field.Store.YES)`
   but when searchin that field with the above analyzer, it cant find the dashes
    ```
   searchTerm = "*" + searchTerm + "*";
    Query q = new WildcardQuery(new Term(field, searchTerm));
   ```
   is there a way to escape the dash from the term or skip analysis from that field?
   thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org