You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2020/04/28 19:28:49 UTC

[GitHub] [lucenenet] NightOwl888 commented on issue #246: Custom StopWord Analyzer - Exception Cannot read from a closed TextReader.

NightOwl888 commented on issue #246:
URL: https://github.com/apache/lucenenet/issues/246#issuecomment-620808822


   As `CreateComponents()` is a factory method (meaning it is a creational operation), only short-lived dependencies should be disposed there. Since you are disposing the stream first before returning it, it is not in a state where the caller of `CreateComponents()` can utilize it.
   
   To make a customized standard analyzer, the best approach would be to model your new class after the [existing StandardAnalyzer class](https://github.com/apache/lucenenet/blob/8cf15f7fd0bb7b22bb2e865895998583d049ab92/src/Lucene.Net.Analysis.Common/Analysis/Standard/StandardAnalyzer.cs).
   
   ```c#
       public sealed class MyStopwordAnalyzer : StopwordAnalyzerBase
       {
           /// <summary>
           /// An unmodifiable set containing some common English words that are usually not
           /// useful for searching. 
           /// </summary>
           public static readonly CharArraySet STOP_WORDS_SET = LoadEnglishStopWordsSet();
   
           private static CharArraySet LoadEnglishStopWordsSet() // LUCENENET: Avoid static constructors (see https://github.com/apache/lucenenet/pull/224#issuecomment-469284006)
           {
               IList<string> stopWords = new string[] { "a", "an", "and", "are", "as", "at", "be",
                   "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on",
                   "or", "such", "that", "the", "their", "then", "there", "these", "they", "this",
                   "to", "was", "will", "with" };
   #pragma warning disable 612, 618
               var stopSet = new CharArraySet(LuceneVersion.LUCENE_CURRENT, stopWords, false);
   #pragma warning restore 612, 618
               return CharArraySet.UnmodifiableSet(stopSet);
           }
   
           /// <summary>
           /// Builds an analyzer with the given stop words. </summary>
           /// <param name="matchVersion"> Lucene compatibility version - See <see cref="MyStopwordAnalyzer"/> </param>
           /// <param name="stopWords"> stop words  </param>
           public MyStopwordAnalyzer(LuceneVersion matchVersion, CharArraySet stopWords)
               : base(matchVersion, stopWords)
           {
           }
   
           /// <summary>
           /// Builds an analyzer with the default stop words (<see cref="STOP_WORDS_SET"/>). </summary>
           /// <param name="matchVersion"> Lucene compatibility version - See <see cref="MyStopwordAnalyzer"/> </param>
           public MyStopwordAnalyzer(LuceneVersion matchVersion)
               : this(matchVersion, STOP_WORDS_SET)
           {
           }
   
           /// <summary>
           /// Builds an analyzer with the stop words from the given reader. </summary>
           /// <seealso cref="WordlistLoader.GetWordSet(TextReader, LuceneVersion)"/>
           /// <param name="matchVersion"> Lucene compatibility version - See <see cref="MyStopwordAnalyzer"/> </param>
           /// <param name="stopwords"> <see cref="TextReader"/> to read stop words from  </param>
           public MyStopwordAnalyzer(LuceneVersion matchVersion, TextReader stopwords)
               : this(matchVersion, LoadStopwordSet(stopwords, matchVersion))
           {
           }
   
           protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
           {
               var src = new StandardTokenizer(m_matchVersion, reader);
               TokenStream tok = new StandardFilter(m_matchVersion, src);
               // tok = new LowerCaseFilter(m_matchVersion, tok); // optional
               tok = new StopFilter(m_matchVersion, tok, m_stopwords);
               return new TokenStreamComponents(src, tok);
           }
       }
   ```
   
   Do note that the existing `StandardAnalyzer` class also allows passing in a `CharArraySet` containing stopwords, which may meet your needs if you wish to use the `LowerCaseFilter` to normalize your text.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org