You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/01 18:34:52 UTC

[GitHub] [lucene] kotman12 opened a new pull request, #11734: Fix repeating token sentence boundary bug

kotman12 opened a new pull request, #11734:
URL: https://github.com/apache/lucene/pull/11734

   Fix sentence boundary detection bug in case of repeating tokens (i.e. while using OpenNLP analysis chain in conjunction with a KeywordRepeatFilter) by keeping track of the sentence index and looking ahead one token. Move inner sentence iteration to a component to be shared by the sentence-aware OpenNLP filters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

dweiss commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254627495

   I don't know what happened there but I'm sure it's going to be fixable. Let me take a look later today or tomorrow morning (I'm out of office today).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

dweiss commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1252099555

   Hi @kotman12 . Sorry for the delay. I'm not that familiar with this part of the codebase but I think I see what's happening and how you managed to fix it. Looks good to me. It'd be good to run this pipeline over a larger text base to make sure there are no surprises and regressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

dweiss commented on code in PR #11734:
URL: https://github.com/apache/lucene/pull/11734#discussion_r975127196


##########
lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java:
##########
@@ -290,4 +299,61 @@ public void testKeywordAttributeAwarenessDictionaryAndMaxEnt() throws Exception
         null,
         true);
   }
+
+  public void testNoBreakWithRepeatKeywordFilter() throws Exception {
+    CustomAnalyzer analyzer =
+        CustomAnalyzer.builder(new ClasspathResourceLoader(getClass()))
+            .withTokenizer(
+                "opennlp", "tokenizerModel", tokenizerModelFile, "sentenceModel", sentenceModelFile)
+            .addTokenFilter("opennlpPOS", "posTaggerModel", "en-test-pos-maxent.bin")
+            .addTokenFilter(KeywordRepeatFilterFactory.class)
+            .addTokenFilter("opennlplemmatizer", "dictionary", "en-test-lemmas.dict")
+            .build();
+    assertAnalyzesTo(
+        analyzer,
+        NO_BREAK_SINGLE_TOKEN_REPEAT_KEYWORD,
+        NO_BREAK_SINGLE_TOKEN_REPEAT_KEYWORD_terms,
+        null,
+        null,
+        null,
+        null,
+        null,
+        true);
+  }
+
+  public void testPreventEarlyExit() throws IOException {
+    ClasspathResourceLoader loader = new ClasspathResourceLoader(getClass());
+    InputStream earlyExitInput = loader.openResource("data/early-exit-bug-input.txt");

Review Comment:
   Can we add a try-with-finally around those opened resource streams? 



##########
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/SentenceAttributeExtractor.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.opennlp;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.SentenceAttribute;
+import org.apache.lucene.util.AttributeSource;
+
+/**
+ * Iterate through sentence tokens and cache their attributes. Could consider moving this to a more
+ * central location to be used by other sentence-aware components.
+ */
+public class SentenceAttributeExtractor {
+
+  private final TokenStream input;
+  private final SentenceAttribute sentenceAtt;
+  private final List<AttributeSource> sentenceTokenAttrs = new ArrayList<>();
+  private AttributeSource prevAttributeSource;
+  private int currSentence = 0;
+  private boolean hasNextToken = true;
+
+  public SentenceAttributeExtractor(TokenStream input, SentenceAttribute sentenceAtt) {
+    this.input = input;
+    this.sentenceAtt = sentenceAtt;
+  }
+
+  public List<AttributeSource> extractSentenceAttributes() throws IOException {
+    sentenceTokenAttrs.clear();
+    boolean hasNext;
+    do {
+      hasNextToken = input.incrementToken();
+      int currSentenceTmp = sentenceAtt.getSentenceIndex();
+      hasNext = currSentence == currSentenceTmp && hasNextToken;

Review Comment:
   ```suggestion
         hasNext = (currSentence == currSentenceTmp && hasNextToken);
   ```
   Makes it clearer to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

dweiss commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256090394

   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256263228

   Thanks as well for taking a look 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1234648586

   Linking issue #11735 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss merged pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

dweiss merged PR #11734:
URL: https://github.com/apache/lucene/pull/11734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

dweiss commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256055653

   I rebased your commits on top of main so that they're linear when merged. Waiting for builds to pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1234911661

   ./gradlew check passed locally as described in the contribution guide 😃 
   ![image](https://user-images.githubusercontent.com/13710476/188029979-f0d271ce-5999-46fc-a585-aa3e6ae9c287.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1253843509

> Hi @kotman12 . Sorry for the delay. I'm not that familiar with this part of the codebase but I think I see what's happening and how you managed to fix it. Looks good to me. It'd be good to run this pipeline over a larger text base to make sure there are no surprises and regressions.

Hi @dweiss .. no worries, thanks for taking a look.

I have made the changes you suggested. I agree that it would be good to test this with a large and more varied text base. I did incorporate these changes separately as a patch in my current project which is how I found that my initial proposal actually had a regression (it threw an NPE in case of empty input). It's possible there other bugs that I haven't uncovered .. did you have any particular test dataset in mind?

-------------------------------------------------------------------------------------------------------------------------------------------
Totally separately I noticed that in the case of stacking two or more sentence-aware token filters, sentence token iteration and attribute cloning are performed separately across the different filters (this is the case in the current implementation as well). This work could probably be done in a dedicated `SentenceExtractionFilter` which could then pass a reference to a _single_ `sentenceTokenAttributes` to all downstream filters that rely on sentence-level analysis (perhaps such a reference could be stashed in the proposed `SentenceAttribute`). Also, following this reasoning a bit further, it is possible to conceive future implementations that rely on passing an arbitrary subset of adjacent tokens together to some analysis function, in which case the use of the term `Sentence` would become a misnomer.

However, addressing these issues would be more effort so I wanted to check if it is even worth it to explore. I am starting to suspect that this package is not really used a lot because otherwise one would expect these bugs to have been caught sooner given the configuration is in the documentation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

dweiss commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254019222

   > I am starting to suspect that this package is not really used a lot because otherwise one would expect these bugs to have been caught sooner given the configuration is in the documentation.
   
   You're very like right. There's a lot of old cruft that's mostly unused lying around. Please add lucene/CHANGES.txt entry (under 9.5.0) and I think it's done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

Posted by GitBox <gi...@apache.org>.

kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254253617

   @dweiss I updated CHANGES.txt but blew up this PR and messed up the history in the process. If you prefer this is a more concise PR with the relevant changes patched in -> https://github.com/apache/lucene/pull/11802


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org