You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/04 12:03:21 UTC

[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #427: LUCENE-10220: Add an utility method to get IntervalSource from analyzed text (or token stream)

sonatype-lift[bot] commented on a change in pull request #427:
URL: https://github.com/apache/lucene/pull/427#discussion_r742771217



##########
File path: lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java
##########
@@ -429,4 +444,300 @@ public static IntervalsSource after(IntervalsSource source, IntervalsSource refe
         source,
         Intervals.extend(new OffsetIntervalsSource(reference, false), 0, Integer.MAX_VALUE));
   }
+
+  /**
+   * Returns intervals that correspond to tokens from a {@link TokenStream} returned for {@code
+   * text} by applying the provided {@link Analyzer} as if {@code text} was the content of the given
+   * {@code field}. The intervals can be ordered or unordered and can have optional gaps inside.
+   *
+   * @param text The text to analyze.
+   * @param analyzer The {@link Analyzer} to use to acquire a {@link TokenStream} which is then
+   *     converted into intervals.
+   * @param field The field {@code text} should be parsed as.
+   * @param maxGaps Maximum number of allowed gaps between sub-intervals resulting from tokens.
+   * @param ordered Whether sub-intervals should enforce token ordering or not.
+   * @return Returns an {@link IntervalsSource} that matches tokens acquired from analysis of {@code
+   *     text}. Possibly an empty interval source, never {@code null}.
+   * @throws IOException If an I/O exception occurs.
+   */
+  public static IntervalsSource analyzedText(
+      String text, Analyzer analyzer, String field, int maxGaps, boolean ordered)
+      throws IOException {
+    try (TokenStream ts = analyzer.tokenStream(field, text)) {
+      return analyzedText(ts, maxGaps, ordered);
+    }
+  }
+
+  /**
+   * Returns intervals that correspond to tokens from the provided {@link CachingTokenFilter}. This
+   * is a low-level counterpart to {@link #analyzedText(String, Analyzer, String, int, boolean)}.
+   * The intervals can be ordered or unordered and can have optional gaps inside.
+   *
+   * @param tokenStream The token stream to produce intervals for. The token stream may be fully or
+   *     partially consumed after returning from this method.
+   * @param maxGaps Maximum number of allowed gaps between sub-intervals resulting from tokens.
+   * @param ordered Whether sub-intervals should enforce token ordering or not.
+   * @return Returns an {@link IntervalsSource} that matches tokens acquired from analysis of {@code
+   *     text}. Possibly an empty interval source, never {@code null}.
+   * @throws IOException If an I/O exception occurs.
+   */
+  public static IntervalsSource analyzedText(TokenStream tokenStream, int maxGaps, boolean ordered)
+      throws IOException {
+    CachingTokenFilter stream =
+        tokenStream instanceof CachingTokenFilter
+            ? (CachingTokenFilter) tokenStream
+            : new CachingTokenFilter(tokenStream);
+
+    TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
+    PositionIncrementAttribute posIncAtt = stream.addAttribute(PositionIncrementAttribute.class);
+    PositionLengthAttribute posLenAtt = stream.addAttribute(PositionLengthAttribute.class);
+
+    if (termAtt == null) {
+      return NO_INTERVALS;
+    }
+
+    // Phase 1: read through the stream and assess the situation:
+    // counting the number of tokens/positions and marking if we have any synonyms.
+
+    int numTokens = 0;
+    boolean hasSynonyms = false;
+    boolean isGraph = false;
+
+    stream.reset();
+    while (stream.incrementToken()) {

Review comment:
       *NULL_DEREFERENCE:*  object `stream.iterator` last assigned on line 489 could be null and is dereferenced by call to `incrementToken()` at line 507.
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org