You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2020/10/14 21:43:06 UTC

[GitHub] [lucenenet] willson556 commented on issue #296: IndexOutOfRangeException when searching

willson556 commented on issue #296:
URL: https://github.com/apache/lucenenet/issues/296#issuecomment-708676417


   I am able to reliably reproduce with one of my datasets but I'm not sure if I could write a test to fail. I'm running on .NET Core/x64.
   
   Similar stack trace to everyone after OP:
   ```
      at Lucene.Net.Util.Automaton.UTF32ToUTF8.Convert(Automaton utf32) 
      at Lucene.Net.Util.Automaton.CompiledAutomaton..ctor(Automaton automaton, Nullable`1 finite, Boolean simplify) 
      at Lucene.Net.Search.FuzzyTermsEnum.InitAutomata(Int32 maxDistance) 
      at Lucene.Net.Search.FuzzyTermsEnum.GetAutomatonEnum(Int32 editDistance, BytesRef lastTerm) 
      at Lucene.Net.Search.FuzzyTermsEnum.MaxEditDistanceChanged(BytesRef lastTerm, Int32 maxEdits, Boolean init) 
      at Lucene.Net.Search.FuzzyTermsEnum..ctor(Terms terms, AttributeSource atts, Term term, Single minSimilarity, Int32 prefixLength, Boolean transpositions) 
      at Lucene.Net.Search.FuzzyQuery.GetTermsEnum(Terms terms, AttributeSource atts) 
      at Lucene.Net.Search.MultiTermQuery.RewriteMethod.GetTermsEnum(MultiTermQuery query, Terms terms, AttributeSource atts) 
      at Lucene.Net.Search.TermCollectingRewrite`1.CollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector) 
      at Lucene.Net.Search.TopTermsRewrite`1.Rewrite(IndexReader reader, MultiTermQuery query) 
      at Lucene.Net.Search.MultiTermQuery.Rewrite(IndexReader reader) 
      at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader) 
      at Lucene.Net.Search.IndexSearcher.Rewrite(Query original) 
      at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query) 
      at Lucene.Net.Search.IndexSearcher.Search(Query query, Filter filter, Int32 n) 
      at Lucene.Net.Search.IndexSearcher.Search(Query query, Int32 n) 
   ```
   Using this analyzer (I'm just starting to come up to speed with Lucene so I'm not sure the arrangement of filters actually makes any sense):
   ```c#
   public class NGramAnalyzer : Analyzer
   {
       private readonly LuceneVersion version;
       private readonly int minGram;
       private readonly int maxGram;
   
       public NGramAnalyzer(LuceneVersion version, int minGram = 2, int maxGram = 8)
       {
           this.version = version;
           this.minGram = minGram;
           this.maxGram = maxGram;
       }
   
       /// <inheritdoc />
       protected override TextReader InitReader(string fieldName, TextReader reader)
       {
           var charMap = new NormalizeCharMap.Builder();
           charMap.Add("_", " ");
           return new MappingCharFilter(charMap.Build(), reader);
       }
   
       /// <inheritdoc />
       protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
       {
           // Splits words at punctuation characters, removing punctuation.
           // Splits words at hyphens, unless there's a number in the token...
           // Recognizes email addresses and internet hostnames as one token.
           var tokenizer = new StandardTokenizer(version, reader);
   
           TokenStream filter = new StandardFilter(version, tokenizer);
   
           // Normalizes token text to lower case.
           filter = new LowerCaseFilter(version, filter);
   
           // Removes stop words from a token stream.
           filter = new StopFilter(version, filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
   
           filter = new EnglishMinimalStemFilter(filter);
   
           filter = new NGramTokenFilter(version, filter, minGram, maxGram);
           return new TokenStreamComponents(tokenizer, filter);
       }
   }
   ```
   
   Setup is then:
   
   ```c#
   var indexStore = new RAMDirectory();
   var indexConfig = new IndexWriterConfig(Version, Analyzer);
   indexWriter = new IndexWriter(indexStore, indexConfig);
   initialIndexingTask = Task.Run(() =>
                                                 {
                                                     var stopwatch = Stopwatch.StartNew();
                                                     indexWriter.AddDocuments(collection.Select(GetAndSubscribeToDocument));
                                                     indexWriter.Commit();
                                                     Debug.WriteLine(@$"{typeof(TDocument)} Indexing: {stopwatch.ElapsedMilliseconds}ms");
                                                 });
   ```
   
   Searching after initial indexing is complete is done with:
   
   ```c#
   using var reader = DirectoryReader.Open(indexWriter.Directory);
   var searcher = new IndexSearcher(reader);
   
   Query? parsedQuery;
   try
   {
       var queryParser = new MultiFieldQueryParser(Version, DefaultSearchFields, Analyzer);
       var terms = new HashSet<Term>();
       queryParser.Parse(query).Rewrite(reader).ExtractTerms(terms);
   
       var boolQuery = new BooleanQuery();
       terms.ForEach(t =>
                       {
                           boolQuery.Add(new FuzzyQuery(t), Occur.SHOULD);
                           boolQuery.Add(new WildcardQuery(t), Occur.SHOULD);
                       });
   
       parsedQuery = boolQuery;
   }
   catch (Exception)
   {
       // TODO: User feedback
       return new (TDocument doc, float score)[0];
   }
   
   var hits = searcher.Search(parsedQuery, resultLimit);
   ```
   
   I've archived off the dataset and code so that I can hopefully go back and gather more data to help troubleshoot. It's worth noting that in my current repro case, I have 4 separate instances of this (RAMDirectory, IndexWriter, and Reader+Searcher) all running at the same time (and with _nearly_ identical datasets). A quick look through the code up and down the stack trace didn't show me anything in Lucene that was obviously shared between those instances that could be the culprit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org