You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2019/12/25 16:03:54 UTC

[GitHub] [lucene-solr] Traktormaster opened a new pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Traktormaster opened a new pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123
 
 
   # Description
   
   This is an improved version of the patch submitted to the Jira issue about a week ago. A lot of information and discussion was posted on Jira. Here are my relevant comments:
   [Initial comment about possible solution](https://issues.apache.org/jira/browse/LUCENE-9093?focusedCommentId=16995738&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16995738)
   [Follow-up with more details](https://issues.apache.org/jira/browse/LUCENE-9093?focusedCommentId=16996736&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16996736)
   
   # Solution
   [Submitted patch and some notes on the implementation](https://issues.apache.org/jira/browse/LUCENE-9093?focusedCommentId=16998660&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16998660)
   
   This pull request is made so that the new parameter's default value is backward-compatible when merging to the 8.x and 7.x branches.
   @dsmiley suggested that for the 9.x (master) branch this should be changed in a second PR. In this 2nd phase I'll create another PR here with a `hl.fragalign` of `0.5` and updated docs. Regarding this 2nd PR I'd also suggest to change the default `hl.bs.type` to `WORD` as the new behavior will then be similar to the other highlighters' default behavior.
   
   # Tests
   All related tests were updated and extended with a few new assertions to check the changes. I wrote one more comment clarifying how the changes work with different BreakIterator types:
   [Clarification on how SENTENCE BI will behave](https://issues.apache.org/jira/browse/LUCENE-9093?focusedCommentId=17002548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17002548)
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `ant precommit` and the appropriate test suite.
   - [x] I have added tests for my changes.
   - [x] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361492729
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -41,6 +43,31 @@
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
 
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      try {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+        Assert.fail("Expected IllegalArgumentException for "+alignment);
+      } catch (IllegalArgumentException e) {
+        if (!e.getMessage().contains("fragmentAlignment")) {
+          throw e;
+        }
+      }
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
 
 Review comment:
   I wasn't sure to write those or not. Are they an overkill?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on issue #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on issue #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#issuecomment-569509670
 
 
   I took your advice and made the LGBI "cache" the current index to avoid having to "fix" it on the underlying BI. I've also reevaluated the handling of DONE results.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361830391
 
 

 ##########
 File path: solr/solrj/src/java/org/apache/solr/common/params/HighlightParams.java
 ##########
 @@ -47,6 +47,8 @@
 
   // sizing
   public static final String FRAGSIZE    = HIGHLIGHT+".fragsize"; // OH, FVH, UH
+  public static final String FRAGALIGNRATIO = HIGHLIGHT+".fragAlignRatio"; // UH
+  public static final String FRAGSIZEISMINIMUM = HIGHLIGHT+".fragsizeIsMinimum"; // UH
 
 Review comment:
   very minor: I'd prefer these two added rows are switched so that fragsizeIsMinimum directly follows fragsize

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361471554
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -44,17 +44,30 @@ public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, in
     return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true);
   }
 
+  /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */
 
 Review comment:
   With \@Deprecated

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361853606
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -167,68 +146,70 @@ public int previous() {
 
   @Override
   public int following(int matchEndIndex) {
-    final int targetIdx = (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment));
+    return following(matchEndIndex, (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment)));
+  }
+
+  private int following(int matchEndIndex, int targetIdx) {
     if (targetIdx >= getText().getEndIndex()) {
-      return baseIter.last();
+      if (currentCache == baseIter.last()) {
+        return DONE;
+      }
+      return currentCache = baseIter.last();
     }
     final int afterIdx = baseIter.following(targetIdx - 1);
     if (afterIdx == DONE) {
-      return baseIter.current();
+      currentCache = baseIter.last();
+      return DONE;
     }
     if (afterIdx == targetIdx) { // right on the money
-      return afterIdx;
+      return currentCache = afterIdx;
     }
     if (isMinimumLength) { // thus never undershoot
-      return afterIdx;
+      return currentCache = afterIdx;
     }
 
     // note: it is a shame that we invoke preceding() *one more time*; BI's are sometimes expensive.
 
     // Find closest break to target
     final int beforeIdx = baseIter.preceding(targetIdx);
     if (targetIdx - beforeIdx < afterIdx - targetIdx && beforeIdx > matchEndIndex) {
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
-    return afterIdx;
-  }
-
-  private int moveToBreak(int idx) { // precondition: idx is a known break
 
 Review comment:
   Yay; moveToBreak be gone

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361750286
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   > hl.closestTo ... I wish it simply always worked this way...
   
   I'm guessing whoever wrote the original UH decided against it for performance reasons. There is a  [comment](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java#L149) to suggest that. Also note that depending on which index is closer the BI may had to invoke `following()` one extra time inside `moveToBreak()`. I think `moveToBreak()` is only necessary when a default-summary is being made and I can remove it from the code when a match is being highlighted.
   
   The performance hit would be depending on the fragment's length that contains the goal index. In WORD mode I'd hope this is negligible. It will necessarily be worse for SENTENCE.
   
   You've brought up a new point to keep in mind: simplifying the options. I'll be back with updated code that we can discuss further.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on issue #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on issue #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#issuecomment-569492072
 
 
   Most of your review points this round are about the correctness of the BreakIterator API/class impl. Yet the comment where the overlap handling is discussed, you would move that logic inside the LGBI which would conflict with the original intention of what the `preceding()` method should implement. (it's not supposed to only work after the current index)
   
   So far I've not made an attempt to make the behavior of this class "consistent" as it is very unclear to what I would want it to be consistent to. The exclusively one place it is ever used would not benefit from more consistency. It is explicitly stated in the class docstring: `Important: This is not a general purpose {@link BreakIterator}; it's only designed to work in a way compatible with the {@link UnifiedHighlighter}.  Some assumptions are checked with Java assertions.` Some methods are straight up unimplemented with an `assert false`.
   
   Basically I can make it behave more consistently, but there's no benefit right now. That's why I haven't done it so already. I guess you're thinking that in the future when someone will work on this again, it would be better to have a little more well rounded implementation. I can respect that.
   
   However in light of what we're going for now, I do not agree to move the `Math.max(..., lastPassageEnd)` functionality into the LGBI. (It would not have a benefit either way IMO)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361745552
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   I just want to confirm: your suggestion will be the code you already wrote above but have not yet pushed and that breaks that one test.  Yes. we can just change that test slightly.
   
   One thing that vexes me a bit about highlighting is that there are so many choices and options to name -- configuration-itis.  Good defaults help.  Also simply making a reasonable choice for 98% of users and letting the 2% do something themselves is also better than adding yet another tuning nob.  I'm not sure where the things we are discussing lie.  (Oh and deprecate/removing the other highlighters would help too ;-)  Any way...
   
   > hl.fragalign
   
   Since you propose the default be 0.5 then in what scenario can you imagine someone would specify it?  I suggest `hl.fragAlignRatio` as a better name if we keep this.
   
   > hl.closestTo
   
   I wish it simply always worked this way (`true`), to be honest.  If I had to give this option a name, it'd definitely have "fragsize" in its name to link it's meaning to the other parameter.  Perhaps `hl.fragsizeIsMinimum` inverted meaning.  Sadly `hl.fragsize` is not camel cased; ah well.  
   
   Maybe these should be _secret_ options with code comments about filing a JIRA issue if you want to change them :-).  After all... I'm not sure _all_ options need to be documented.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361471705
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -44,17 +44,30 @@ public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, in
     return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true);
   }
 
+  /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */
+  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength) {
+    return createMinLength(baseIter, minLength, 0.f);
+  }
+
   /** Breaks will be on average {@code targetLength} apart; the closest break to this target (before or after)
    * is chosen. */
   public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength,
 
 Review comment:
   Please document fragmentAlignment

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361375534
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   couldn't fragmentStart argument be offset so that baseIter needn't re-analyze the text before offset?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361374225
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -33,23 +33,32 @@
 
   private final BreakIterator baseIter;
   private final int lengthGoal;
+  private final float fragmentAlignment; // how much text to align before match-fragment, valid in range [0, 1]
   private final boolean isMinimumLength; // if false then is "closest to" length
+  private int fragmentEndFromPreceding; // store the match-break end for reuse in following()
+  private int fragmentEndFollowingLengthGoalFromPreceding; // store the remaining length to collect in following()
 
   /** Breaks will be at least {@code minLength} apart (to the extent possible). */
-  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength) {
-    return new LengthGoalBreakIterator(baseIter, minLength, true);
+  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength,
 
 Review comment:
   I propose adding a deprecated version of the current method with 0f for back-compat.   Solr won't use it but others out there might.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361838528
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -104,10 +126,37 @@ public int next(int n) {
     return baseIter.next(n); // probably wrong
   }
 
-  // called by getSummaryPassagesNoHighlight to generate default summary.
+  // Called by getSummaryPassagesNoHighlight to generate default summary.
+  // This is the same implementation that following() has, except:
 
 Review comment:
   I'll find a workaround to reduce the duplication.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840892
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -39,65 +41,142 @@
   //                      0         1
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      expectThrows(IllegalArgumentException.class, () -> {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+      });
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+    backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
     // "goal" means target length goal to find closest break
 
     // at first word:
     Query query = query("aa");
-    assertEquals("almost two sent",
-        "<b>Aa</b> bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   **EDIT**: NVM this, I've checked this and it was never a problem, but could be in the proposed version.
   I've not checked this but thinking about it makes me believe that previous to my changes this could have happened regardless, since the `following()` was called with `start` instead of `end - 1`. The reason it was not a problem could be that it would only happen at a ridiculously low `fragsize`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361827668
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0));
-        passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength));
+        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd));
 
 Review comment:
   Oh wait; something occurred to me.  The breakIterator.preceding impl doesn't intrinsically know that FieldHighlighter is going to call `Math.max(..., lastPassageEnd)` on it.  And I recall you are adding this change here in FieldHighlighter because the updated LengthGoalBreakIterator might want to look further back to the left into a zone that might have been part of a previous Passage.  Maybe `LengthGoalBreakIterator.preceding` should examine `current()` at the start and ensure it doesn't yield a break before that.  Then FieldHighlighter wouldn't change.  Without this small proposal, the length of this passage will be undersized because LengthGoalBreakIterator doesn't know FieldHighlighter is going to chop off some of the beginning thanks to that `max()`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361472276
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -41,6 +43,31 @@
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
 
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      try {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+        Assert.fail("Expected IllegalArgumentException for "+alignment);
+      } catch (IllegalArgumentException e) {
 
 Review comment:
   See `LuceneTestCase.exceptThrows`
   
   BTW this is very thorough of you to test this

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   The change is the same in WORD mode too, but the match and the fragment of the match is mostly the same. It's just not that apparent compared to the SENTENCE mode.
   
   I'll try and clarify all the expressions:
   - Source text: `I'm looking for a beach ball for my vacation. I've lost the old one. I liked that.`
   - Query text: `vacation`
   - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is `8`
   - Match-fragment is the fragment that contains the match according to the underlying BI:
     - WORD: `vacation` because it breaks on space and period.
     - SENTENCE: `I'm looking for a beach ball for my vacation.` because it breaks on period.
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=0.0`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 1 extra char on the right, but 10 is requested. Will try to append next fragments to reach `match.end+((1-fragalign)-fragsize)` (which is `54`).
     - Next break after `54` is the end of the second sentence so the result snippet will be: `I'm looking for a beach ball for my vacation. I've lost the old one.`
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=0.5`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 5 is requested. Also 5 are requested on the right so the extension will happen just like in the previous example. The result will be `I'm looking for a beach ball for my vacation. I've lost the old one.`.
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=1.0`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 10 is requested. No extra chars are requested on the right so the match fragment is the result.
   
   As you can see the `fragsize` is **not** about the final size of the snippet, but about the text around the match.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361828027
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -104,10 +126,37 @@ public int next(int n) {
     return baseIter.next(n); // probably wrong
   }
 
-  // called by getSummaryPassagesNoHighlight to generate default summary.
+  // Called by getSummaryPassagesNoHighlight to generate default summary.
+  // This is the same implementation that following() has, except:
 
 Review comment:
   This is a lot of code to duplicate; lets not do this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361828632
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -173,8 +205,30 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
 
   // called at start of new Passage given first word start offset
   @Override
-  public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+  public int preceding(int matchStartIndex) {
+    final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * fragmentAlignment);
+    if (targetIdx <= 0) {
+      return 0;
 
 Review comment:
   Note that you didn't set the state on the underlying BI for this case.  But we should do that for correctness?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r362016176
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -60,12 +83,15 @@ private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, boolean
   @Override
   public String toString() {
     String goalDesc = isMinimumLength ? "minLen" : "targetLen";
-    return getClass().getSimpleName() + "{" + goalDesc + "=" + lengthGoal + ", baseIter=" + baseIter + "}";
+    return getClass().getSimpleName() + "{" + goalDesc + "=" + lengthGoal + ", fragAlign=" + fragmentAlignment +
+        ", baseIter=" + baseIter + ", currentCache=" + currentCache + "}";
 
 Review comment:
   I'm going to remove currentCache from the toString when I commit because I don't think it belongs.  It's an internal detail.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361853588
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -167,68 +146,70 @@ public int previous() {
 
   @Override
   public int following(int matchEndIndex) {
-    final int targetIdx = (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment));
+    return following(matchEndIndex, (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment)));
+  }
+
+  private int following(int matchEndIndex, int targetIdx) {
     if (targetIdx >= getText().getEndIndex()) {
-      return baseIter.last();
+      if (currentCache == baseIter.last()) {
+        return DONE;
+      }
+      return currentCache = baseIter.last();
     }
     final int afterIdx = baseIter.following(targetIdx - 1);
     if (afterIdx == DONE) {
-      return baseIter.current();
+      currentCache = baseIter.last();
+      return DONE;
     }
     if (afterIdx == targetIdx) { // right on the money
-      return afterIdx;
+      return currentCache = afterIdx;
     }
     if (isMinimumLength) { // thus never undershoot
-      return afterIdx;
+      return currentCache = afterIdx;
     }
 
     // note: it is a shame that we invoke preceding() *one more time*; BI's are sometimes expensive.
 
     // Find closest break to target
     final int beforeIdx = baseIter.preceding(targetIdx);
     if (targetIdx - beforeIdx < afterIdx - targetIdx && beforeIdx > matchEndIndex) {
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
-    return afterIdx;
-  }
-
-  private int moveToBreak(int idx) { // precondition: idx is a known break
-    // bi.isBoundary(idx) has side-effect of moving the position.  Not obvious!
-    //boolean moved = baseIter.isBoundary(idx); // probably not particularly expensive
-    //assert moved && current() == idx;
-
-    // TODO fix: Would prefer to do "- 1" instead of "- 2" but CustomSeparatorBreakIterator has a bug.
-    int current = baseIter.following(idx - 2);
-    assert current == idx : "following() didn't move us to the expected index.";
-    return idx;
+    // moveToBreak is necessary for when getSummaryPassagesNoHighlight calls next and current() is used
+    return currentCache = afterIdx;
   }
 
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int matchStartIndex) {
     final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * fragmentAlignment);
     if (targetIdx <= 0) {
-      return 0;
+      if (currentCache == baseIter.first()) {
+        return DONE;
+      }
+      return currentCache = baseIter.first();
     }
     final int beforeIdx = baseIter.preceding(targetIdx + 1);
     if (beforeIdx == DONE) {
-      return 0;
+      currentCache = baseIter.first();
+      return DONE;
     }
     if (beforeIdx == targetIdx) { // right on the money
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
     if (isMinimumLength) { // thus never undershoot
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
 
     // note: it is a shame that we invoke following() *one more time*; BI's are sometimes expensive.
 
     // Find closest break to target
     final int afterIdx = baseIter.following(targetIdx - 1);
     if (afterIdx - targetIdx < targetIdx - beforeIdx && afterIdx < matchStartIndex) {
-      return afterIdx;
+      return currentCache = afterIdx;
     }
-    return beforeIdx;
+    // moveToBreak is for consistency
 
 Review comment:
   no longer need to reference moveToBreak

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840489
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -39,65 +41,142 @@
   //                      0         1
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      expectThrows(IllegalArgumentException.class, () -> {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+      });
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+    backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
     // "goal" means target length goal to find closest break
 
     // at first word:
     Query query = query("aa");
-    assertEquals("almost two sent",
-        "<b>Aa</b> bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   I've already thought about trying that, but there's one case I think could break and I didn't see it worthwhile to investigate further.
   
   Since the index tokenizer and the highlight separator may work on completely different characters, the separator could break inside the match if I allow it to process the chars of the match itself. In that case with very low `fragsize` or very uneven `fragAlignRatio` a break inside the match could be accepted and the [first assertion](https://github.com/apache/lucene-solr/blob/3ae1a0b3bad84cdfaa3941b87a1a7fcad63a66d4/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java#L45) in `Passage.addMatch` would fail.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361829906
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -39,65 +41,142 @@
   //                      0         1
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      expectThrows(IllegalArgumentException.class, () -> {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+      });
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+    backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
     // "goal" means target length goal to find closest break
 
     // at first word:
     Query query = query("aa");
-    assertEquals("almost two sent",
-        "<b>Aa</b> bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   I looked closer at why these numbers changed by 2 and I see it's because FieldHighlighter now passes the match end offset for the `following` call.  As you've been telling me, the sizing tends to be larger by the match itself.  Hmmm.  Perhaps FieldHighlighter should calculate the mid-point and then use that for both calls.  The test might then be off by only 1 but still, the results would be closer to the intended sizing.  WDYT?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361853963
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -39,65 +41,142 @@
   //                      0         1
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      expectThrows(IllegalArgumentException.class, () -> {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+      });
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+    backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
     // "goal" means target length goal to find closest break
 
     // at first word:
     Query query = query("aa");
-    assertEquals("almost two sent",
-        "<b>Aa</b> bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   A min/max check in FieldHighlighter could correct for this concern.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361658697
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   When WORD mode is used, is there no change RE fragsize interpretation?  By "matchsize" do you mean the length of the matched words themselves?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361491499
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -33,23 +33,45 @@
 
   private final BreakIterator baseIter;
   private final int lengthGoal;
+  private final float fragmentAlignment; // how much text to align before match-fragment, valid in range [0, 1]
   private final boolean isMinimumLength; // if false then is "closest to" length
+  private int fragmentEndFromPreceding; // store the match-break end for reuse in following()
+  private int fragmentEndFollowingLengthGoalFromPreceding; // store the remaining length to collect in following()
 
   /** Breaks will be at least {@code minLength} apart (to the extent possible). */
+  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength,
+                                                        float fragmentAlignment) {
+    return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true);
+  }
+
+  /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */
   public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength) {
-    return new LengthGoalBreakIterator(baseIter, minLength, true);
+    return createMinLength(baseIter, minLength, 0.f);
   }
 
   /** Breaks will be on average {@code targetLength} apart; the closest break to this target (before or after)
    * is chosen. */
+  public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength,
+                                                              float fragmentAlignment) {
+    return new LengthGoalBreakIterator(baseIter, targetLength, fragmentAlignment, false);
+  }
+
+  /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */
   public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength) {
-    return new LengthGoalBreakIterator(baseIter, targetLength, false);
+    return createClosestToLength(baseIter, targetLength, 0.f);
   }
 
-  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, boolean isMinimumLength) {
+  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, float fragmentAlignment,
+                                  boolean isMinimumLength) {
     this.baseIter = baseIter;
     this.lengthGoal = lengthGoal;
+    if (fragmentAlignment < 0.f || fragmentAlignment > 1.f || !Float.isFinite(fragmentAlignment)) {
+      throw new IllegalArgumentException("fragmentAlignment must be >= zero and <= one");
+    }
+    this.fragmentAlignment = Math.max(Math.min(fragmentAlignment, 1.f), 0.f);
 
 Review comment:
   I was too preoccupied with the testing, so I forgot about that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361480777
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0));
 
 Review comment:
   The "max" with 0 had defended against the DONE scenario.  I'm honestly not sure if DONE is possible.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361374902
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -116,28 +128,20 @@ public int previous() {
     return baseIter.previous();
   }
 
-  // called while the current position is the start of a new passage; find end of passage
+  // NOTE: this.fragmentEndFromPreceding is used instead of the parameter!
+  // This is a big diversion from the API a BreakIterator should implement, but specifically this optimization is fine.
   @Override
   public int following(int followingIdx) {
 
 Review comment:
   based on your comment, rename this parameter to _unused_followingIdx

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on issue #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on issue #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#issuecomment-569408594
 
 
   Basically the whole patch has been reworked, here are the highlights:
   - About half as many calls to `preceding()` and `following()` on the wrapped BI.
   - `fragsize` is about the length of the contextual text around the match. (it's relation to the snippet size is indirect)
   - `fragalign` has been renamed to `fragAlignRatio`. Changed it's default value to `0.5`.
   - Made the hidden `closestTo/targetLen` mode the default instead of `minimumLen`. This adds an automatic slop-like behaviour.
   - The new parameter `fragsizeIsMinimum` can be used to switch back to the previously used mode. (minimum mode is a little faster)
   - Changed the call of `following(start)` in `FieldHighlighter` to `following(end - 1)` for performance and correctness.
   - Tests and docs have been updated.
   
   With these defaults the UH will give more meaingful results as-is. To have even more similar results to the older highlighters the `hl.bs.type` can be switched to `WORD`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361432710
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   A little correction: the **offset** parameter is the start of the match and the fragmentStart could be different because of the issues described above.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361471274
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   I understand what `offset` param is and what `fragmentStart` is, and that the boundaries of each are in different universes that may or may not align.  But I don't see why that matters here.  I claim the call to `baseIter.following(fragmentStart)` is guaranteed to be equal to `baseIter.following(offset)` _because_ fragmentStart is `baseIter.preceding(offset)`.  The argument to the breakIterator can be any arbitrary character offset; it needn't be an existing break of any kind.  FWIW I made the simple correction and tests pass.  If you insist there is a problem then can you present a failing test?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361426082
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   Unfortunately no. The fragmentStart argument is the start of the match that could be anything depending on the tokenizer in the index analyzer chain. Even if we assume it's the start of a word or a phrase, the underlying BI can break on different places. In case of SENTENCE the preceding() call here will find the beginning of the sentence. In case of SEPARATOR, which is customizable by query, the breaks can be anywhere else.
   We could only assume fragmentStart is a break point if the underlying BI would be the same as the tokenizer in the index analyzer chain. (I'm not sure, but the query analyzer chain could be different I think.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361376198
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   A point I want to make is that BreakIterators can be rather expensive and so it's good to try to avoid calling it more than needed, and to help it do less work.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361535945
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -41,6 +43,31 @@
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
 
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      try {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+        Assert.fail("Expected IllegalArgumentException for "+alignment);
+      } catch (IllegalArgumentException e) {
+        if (!e.getMessage().contains("fragmentAlignment")) {
+          throw e;
+        }
+      }
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
 
 Review comment:
   It's fine.  I can't say I would have bothered but you did it and the test has value.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361655197
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   Correction: the snippet size is not `fragsize+matchsize`, but `min(sentence_left_of_match, fragsize*fragalign) + matchsize + min(sentence_right_of_match, fragsize*(1-fragalign))`.
   So if I want a left aligned highlight in SENTENCE mode, but the match is on the right end of the sentence, the whole `fragsize` would be added on the right.
   This is less accurate if you're concerned about the result snippet's size, but more accurate if you're concerned about the alignment of the match.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361486777
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   I'm noticing that the logic here for fragment alignment doesn't seem to care whatsoever about where "offset" is within it's segment.  But isn't that super relevant?  As I consider a BI on SENTENCE based segments, I think it is... but I can see how you overlooked this when focusing on WORD scenarios.  For example assuming 0.5 fragmentAlignment, and if the "offset" (the match) happens to occur at the right end of the segment, and lets say the length goal is only 10 chars larger than centerLength, then shouldn't we expand to the right and not the left?
   
   You're going to hate me for this but let me make a proposal :-)  What if we multiply fragmentAlignment by lengthGoal and interpret this as a minimum number of characters wanted to the left of the start of the match (`offset`).  The difference of that with lengthGoal indicates minimum chars wanted to the right.  We use the delegate BI to find the passage start, and then we can consult fragmentAlignment with where `offset` is relative to the start to decide how much text to the right of `offset` we want.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361483326
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
 
 Review comment:
   All this is inherently confusing so lets add a bit more comments or reconsider some variable names.  It would help me understand "centerLength" with a comment above saying something like "centerLength is the length of the center-most segment.".  Or maybe "span" and not segment; too bad there isn't a clear word for these.  Could be a sentence, word, or something else; its whatever the BI is breaking.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361539250
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   I'm glad you're cool with entertaining reworking the whole thing :-)  BTW my proposal is rather incomplete or imperfect or both.  We probably won't get a perfect lengthGoal hit (hence "goal" in the name).  And we almost certainly won't get a perfect fragmentAlignment ratio with before/after -- it should perhaps also have "goal" in the name.  So whatever the solution is will have some inherent fuzziness to it.  This is probably all obvious to you but I just want to state I'm aware you can poke lots of holes in my proposal.  I just think we can do a bit better than your original proposal by using some more information.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361541004
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   BTW see Solr's "LuceneRegexFragmenter" which has a "slop" notion and may have ideas worth looking at.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361650672
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   I have a working version with this new behavior. There is just a single assertion that fails [here](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/test/org/apache/solr/highlight/TestUnifiedSolrHighlighter.java#L267). It's because the `fragsize=60` there means to try and get 60 characters on the right side of the match as the default `fragalign` is zero. To pass the test the query parameters need to be changed to have `hl.fragsize=50&hl.fragalign=0.5`.
   This is not a problem by itself, but there is no way to make the new behavior completely backwards-compatible like the previous one.
   
   It could also be a bit jarring to have `fragsize` with a distinctively different meaning compared to the other highlighters:
   - For original&fastVector the `fragsize` is the target size of the snippet.
   - For unified the `fragsize` is the size of the contextual text around the match. Ie.: the snippet's target size is `fragsize+matchsize`.
   
   I'm still in favor of this change, there's literally half as many calls to `preceding()` and `following()` of the underlying BI and it's much clearer as well.
   
   Should I change the failing test and push this for review? What about the backward-compatibility? I'm not sure how to change the docs or the default parameters.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361704668
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   Maybe one problem with this is that it wouldn't be quite as ideal for WORD, which would in practice always put the word at the 1/3rd and not half way.  This generic wrapping BI doesn't know on average how big the underlying BI will tend to be (SENTENCE vs WORD).  I suppose that could be addressed by making the "slop" configurable.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361828673
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -173,8 +205,30 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
 
   // called at start of new Passage given first word start offset
   @Override
-  public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+  public int preceding(int matchStartIndex) {
+    final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * fragmentAlignment);
+    if (targetIdx <= 0) {
+      return 0;
+    }
+    final int beforeIdx = baseIter.preceding(targetIdx + 1);
+    if (beforeIdx == DONE) {
+      return 0;
+    }
+    if (beforeIdx == targetIdx) { // right on the money
+      return beforeIdx;
+    }
+    if (isMinimumLength) { // thus never undershoot
+      return beforeIdx;
+    }
+
+    // note: it is a shame that we invoke following() *one more time*; BI's are sometimes expensive.
+
+    // Find closest break to target
+    final int afterIdx = baseIter.following(targetIdx - 1);
+    if (afterIdx - targetIdx < targetIdx - beforeIdx && afterIdx < matchStartIndex) {
+      return afterIdx;
+    }
+    return beforeIdx;
 
 Review comment:
   No moveToBreak and so the underlying BI here is not consistent.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361482118
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -33,23 +33,45 @@
 
   private final BreakIterator baseIter;
   private final int lengthGoal;
+  private final float fragmentAlignment; // how much text to align before match-fragment, valid in range [0, 1]
   private final boolean isMinimumLength; // if false then is "closest to" length
+  private int fragmentEndFromPreceding; // store the match-break end for reuse in following()
+  private int fragmentEndFollowingLengthGoalFromPreceding; // store the remaining length to collect in following()
 
   /** Breaks will be at least {@code minLength} apart (to the extent possible). */
+  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength,
+                                                        float fragmentAlignment) {
+    return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true);
+  }
+
+  /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */
   public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength) {
-    return new LengthGoalBreakIterator(baseIter, minLength, true);
+    return createMinLength(baseIter, minLength, 0.f);
   }
 
   /** Breaks will be on average {@code targetLength} apart; the closest break to this target (before or after)
    * is chosen. */
+  public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength,
+                                                              float fragmentAlignment) {
+    return new LengthGoalBreakIterator(baseIter, targetLength, fragmentAlignment, false);
+  }
+
+  /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */
   public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength) {
-    return new LengthGoalBreakIterator(baseIter, targetLength, false);
+    return createClosestToLength(baseIter, targetLength, 0.f);
   }
 
-  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, boolean isMinimumLength) {
+  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, float fragmentAlignment,
+                                  boolean isMinimumLength) {
     this.baseIter = baseIter;
     this.lengthGoal = lengthGoal;
+    if (fragmentAlignment < 0.f || fragmentAlignment > 1.f || !Float.isFinite(fragmentAlignment)) {
+      throw new IllegalArgumentException("fragmentAlignment must be >= zero and <= one");
+    }
+    this.fragmentAlignment = Math.max(Math.min(fragmentAlignment, 1.f), 0.f);
 
 Review comment:
   Why the max & min at this point?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840892
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -39,65 +41,142 @@
   //                      0         1
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      expectThrows(IllegalArgumentException.class, () -> {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+      });
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+    backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString();
+    assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
     // "goal" means target length goal to find closest break
 
     // at first word:
     Query query = query("aa");
-    assertEquals("almost two sent",
-        "<b>Aa</b> bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   I've not checked this but thinking about it makes me believe that previous to my changes this could have happened regardless, since the `following()` was called with `start` instead of `end - 1`. The reason it was not a problem could be that it would only happen at a ridiculously low `fragsize`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0));
-        passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength));
+        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd));
+        lastPassageEnd = Math.min(this.breakIterator.following(start), contentLength);
 
 Review comment:
   Changing the `following(start)` to `following(end - 1)` would have been a great idea before this changeset. However these two calls on the `LengthGoalBreakIterator` are very specifically crafted. The `preceding()` prepares state data that `following()` will use. Currently the workflow is:
   1. The call to `preceding(start + 1)` will identify the `startIdx` and `endIdx` around the match by the underlying BI. This `startIdx` can be earlier than the start of the match. Similarly this `endIdx` can be later than the end of the match. (this is the usual case for SENTENCE BI) Both are needed at this point to know how long the match is. Then `fragsize-(endIdx-startIdx)` is the remaining size to give as context around the first fragment. Then `context_size*fragalign` is the length `preceding()` will try to prepend to the fragment. Finally it will return the final start position.
   2. The call to `following(start)` ignores the argument because the end of the match's fragment (`endIdx`) was already found in `preceding()`. `following()` will try to append the missing length also prepared by `preceding()`. Then return the final end position.
   
   This is what I was referring to in my comment when I submitted the first version of the patch. Ideally there was a single method on the `LengthGoalBreakIterator` that would receive the start **and** end positions of the match to work with. That way the processing would be more optimal. The underlying BI would not need to iterate over the content from start to end when searching for the `endIdx`. It could start from `end - 1` of the match.
   
   I have tried to change up this part of the code more, but a whole bunch of tests broke. I think there is some other BreakIterator subclass that gets passed around here. So this part can't be rewritten to cater to this specific behavior without greater changes.
   
   Finally, `max(preceding(start + 1) ,lastPassageEnd)` is needed because the `preceding()` call can go backwards and create overlaps with the previous passage.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361847732
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -173,8 +205,30 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
 
   // called at start of new Passage given first word start offset
   @Override
-  public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+  public int preceding(int matchStartIndex) {
+    final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * fragmentAlignment);
+    if (targetIdx <= 0) {
+      return 0;
+    }
+    final int beforeIdx = baseIter.preceding(targetIdx + 1);
+    if (beforeIdx == DONE) {
+      return 0;
 
 Review comment:
   This again depends on what we want to accomplish. The original and correct behavior of a BreakIterator would be to return the DONE. Since I'm working on making the LGBI more correct, should I change these? Although `FieldHighlighter` will have to check for DONE if I remove these checks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361701592
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   Thanks for the detailed example.
   
   > As you can see the fragsize is not about the final size of the snippet, but about the text around the match.
   
   I think it's a bit strong to claim fragsize is _not_ about the final size of the snippet.  There is a linear relationship.  Decreasing this reduces snippet sizes and increasing it will increase snippet sizes.  It's just rather indirect because we are prioritizing the alignment.  And perhaps we overshoot too much.
   
   Also; couldn't we subtract out the match length itself if it's unduly influencing the final snippet size?
   
   Lets pretend we're at a more perfect future and not concerning ourselves with backwards compatibility.  Realistically, do you imagine a user would want to use fragalign of anything other than 0.5?  I doubt it; I think we should assume users want this.  Advanced users can choose _not_ to use this BreakIterator -- there are others plus they can write their own (which I've done).  Also, rather than assuming users need 50% of fragsize (lengthGoal) on both sides, maybe assume 60% _of that_ (I'm looking at default slop in LuceneRegexFragmenter which is similar), so thus ~0.3 (or perhaps 1/3rd) fragsize minimum left & right.  I think this will lead to snippets that are both closer to the desired fragsize on average and lead to more balanced passages than we have today.  LMK what you think.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361480642
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   AH I see! I somehow misinterpreted your question. You're mostly correct. The `baseIter.following(fragmentStart)` should be `baseIter.following(offset - 1)`. The minus one is needed for the very unlikely edge case where the match is a single character and the user requests the absolute minimal length for highlight. I added a little test that checks for that, you can see for yourself.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361374322
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -33,23 +33,32 @@
 
   private final BreakIterator baseIter;
   private final int lengthGoal;
+  private final float fragmentAlignment; // how much text to align before match-fragment, valid in range [0, 1]
   private final boolean isMinimumLength; // if false then is "closest to" length
+  private int fragmentEndFromPreceding; // store the match-break end for reuse in following()
+  private int fragmentEndFollowingLengthGoalFromPreceding; // store the remaining length to collect in following()
 
   /** Breaks will be at least {@code minLength} apart (to the extent possible). */
-  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength) {
-    return new LengthGoalBreakIterator(baseIter, minLength, true);
+  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength,
+                                                        float fragmentAlignment) {
+    return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true);
   }
 
   /** Breaks will be on average {@code targetLength} apart; the closest break to this target (before or after)
    * is chosen. */
-  public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength) {
-    return new LengthGoalBreakIterator(baseIter, targetLength, false);
+  public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength,
 
 Review comment:
   Please add a deprecated variant of the old signature.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361472624
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -44,17 +44,30 @@ public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, in
     return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true);
 
 Review comment:
   super minor: please add a space after the comma

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r362015879
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -167,68 +146,70 @@ public int previous() {
 
   @Override
   public int following(int matchEndIndex) {
-    final int targetIdx = (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment));
+    return following(matchEndIndex, (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment)));
+  }
+
+  private int following(int matchEndIndex, int targetIdx) {
     if (targetIdx >= getText().getEndIndex()) {
-      return baseIter.last();
+      if (currentCache == baseIter.last()) {
+        return DONE;
+      }
+      return currentCache = baseIter.last();
     }
     final int afterIdx = baseIter.following(targetIdx - 1);
     if (afterIdx == DONE) {
-      return baseIter.current();
+      currentCache = baseIter.last();
+      return DONE;
     }
     if (afterIdx == targetIdx) { // right on the money
-      return afterIdx;
+      return currentCache = afterIdx;
     }
     if (isMinimumLength) { // thus never undershoot
-      return afterIdx;
+      return currentCache = afterIdx;
     }
 
     // note: it is a shame that we invoke preceding() *one more time*; BI's are sometimes expensive.
 
     // Find closest break to target
     final int beforeIdx = baseIter.preceding(targetIdx);
     if (targetIdx - beforeIdx < afterIdx - targetIdx && beforeIdx > matchEndIndex) {
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
-    return afterIdx;
-  }
-
-  private int moveToBreak(int idx) { // precondition: idx is a known break
-    // bi.isBoundary(idx) has side-effect of moving the position.  Not obvious!
-    //boolean moved = baseIter.isBoundary(idx); // probably not particularly expensive
-    //assert moved && current() == idx;
-
-    // TODO fix: Would prefer to do "- 1" instead of "- 2" but CustomSeparatorBreakIterator has a bug.
-    int current = baseIter.following(idx - 2);
-    assert current == idx : "following() didn't move us to the expected index.";
-    return idx;
+    // moveToBreak is necessary for when getSummaryPassagesNoHighlight calls next and current() is used
+    return currentCache = afterIdx;
   }
 
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int matchStartIndex) {
     final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * fragmentAlignment);
     if (targetIdx <= 0) {
-      return 0;
+      if (currentCache == baseIter.first()) {
+        return DONE;
+      }
+      return currentCache = baseIter.first();
     }
     final int beforeIdx = baseIter.preceding(targetIdx + 1);
     if (beforeIdx == DONE) {
-      return 0;
+      currentCache = baseIter.first();
+      return DONE;
     }
     if (beforeIdx == targetIdx) { // right on the money
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
     if (isMinimumLength) { // thus never undershoot
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
 
     // note: it is a shame that we invoke following() *one more time*; BI's are sometimes expensive.
 
     // Find closest break to target
     final int afterIdx = baseIter.following(targetIdx - 1);
     if (afterIdx - targetIdx < targetIdx - beforeIdx && afterIdx < matchStartIndex) {
-      return afterIdx;
+      return currentCache = afterIdx;
     }
-    return beforeIdx;
+    // moveToBreak is for consistency
 
 Review comment:
   I'll remove this comment on commit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361734439
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   > Also; couldn't we subtract out the match length itself if it's unduly influencing the final snippet size?
   
   My original patch was so convoluted because that cannot be done easily in the current implementation with the BreakIterator base class. The call to `preceding()` on the LGBI receives the `matchStart` and returns the final start of the snippet. Since this method has no way to know the other end of the match it cannot adjust to the length of the match. Using private state the `following()` could know it, but that  would only be able to adjust on the right hand side and not the left.
   Something like `public int[] getPassageRange(int matchStart, int matchEnd) { ... }` would allow a better implementation to be written. But that requires changing `protected final BreakIterator breakIterator;` inside `FieldHighlighter` to a custom BI base that has such method.
   
   > Realistically, do you imagine a user would want to use fragalign of anything other than 0.5?
   
   I'd say the margin values of `fragalign` are of little use, but having the ability to set it in the range of`0.35-0.65` could be worthwhile.
   
   I'm not sure if a `slop` parameter similar to RegexFragmenter's would be too beneficial.
   - The `RegexFragmenter` processes the whole stored text (until `hl.regex.maxAnalyzedChars`) and creates a list of breaks called `hotSpots`. Interestingly it also uses the tokenizer's offsets from the index (I think), because even if a break would be in the middle of a word it will not break it in half. Unless I'm completely missing something the regex seems to only give a strong recommendation to where a break should go.
   - Implementing an acceptable range (`slop`) would require to build some list of `hotSpots` or at least to look for them when extracting the fragment. For this we could only use `preceding()` and `following()` on the wrapped BI, which we wanted to avoid if possible for performance reasons.
   - For WORD BI the current solution is pretty accurate and is as optimal as it gets. (single call of `preceding()` and `following()` on the underlying BI)
   - For SENTENCE BI the slop would not help enough to be worth the trouble I think. Let's say I have to extend the snippet on the right side with 100 chars of a slop 0.4 (valid range: 60-140):
     - If the next sentence is 300 chars long there's nothing better to do what we already do.
     - If the next sentences have the lengths of [30, 50, 200, 40] the better choice would be to have the first two sentences appended with a sum length of 80. (that's only 20 off from the goal) For this the LGBI can be created with `closestTo` mode instead of  `minLength` mode. It would cost an extra call of `preceding()` and `following()` each but would produce the better result instead of appending 3 of the following sentences with 280 total length.
   
   So we already have a `slop` of sorts... it's just a boolean that cannot be turned on right now as it is missing a parameter. I've mentioned [this todo](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330) about it before.
   
   After looking though all of this, my suggestion is to:
   - keep `hl.fragalign` and have it be `0.5` by default
   - add `hl.closestTo` with default `false`
     - a better parameter name would be welcome
     - we could consider making it automatically turned on if the BI is SENTENCE
   - do not worry about keeping a backward-compatible behavior with this change

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361828618
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -173,8 +205,30 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
 
   // called at start of new Passage given first word start offset
   @Override
-  public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+  public int preceding(int matchStartIndex) {
+    final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * fragmentAlignment);
+    if (targetIdx <= 0) {
+      return 0;
+    }
+    final int beforeIdx = baseIter.preceding(targetIdx + 1);
+    if (beforeIdx == DONE) {
+      return 0;
 
 Review comment:
   Returning beforeIdx (DONE) is perhaps more accurate/correct than 0?  Or maybe first()?  Perhaps doesn't matter; this is perhaps theoretical.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   The change is the same in WORD mode too, but the match and the fragment of the match is mostly the same. It's just not that apparent compared to the SENTENCE mode.
   
   I'll try and clarify all the expressions:
   - Source text: `I'm looking for a beach ball for my vacation. I've lost the old one. I liked that.`
   - Query text: `vacation`
   - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is `8`
   - Match-fragment is the fragment that contains the match according to the underlying BI:
     - WORD: `vacation` because it breaks on space and period.
     - SENTENCE: `I'm looking for a beach ball for my vacation.` because it breaks on period.
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=0.0`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 1 extra char on the right, but 10 is requested. Will try to append next fragments to reach `match.end+((1-fragalign)*fragsize)` (which is `54`).
     - Next break after `54` is the end of the second sentence so the result snippet will be: `I'm looking for a beach ball for my vacation. I've lost the old one.`
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=0.5`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 5 is requested. Also 5 are requested on the right so the extension will happen just like in the previous example. The result will be `I'm looking for a beach ball for my vacation. I've lost the old one.`.
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=1.0`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 10 is requested. No extra chars are requested on the right so the match fragment is the result.
   
   As you can see the `fragsize` is **not** about the final size of the snippet, but about the text around the match.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361370892
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -33,23 +33,32 @@
 
   private final BreakIterator baseIter;
   private final int lengthGoal;
+  private final float fragmentAlignment; // how much text to align before match-fragment, valid in range [0, 1]
   private final boolean isMinimumLength; // if false then is "closest to" length
+  private int fragmentEndFromPreceding; // store the match-break end for reuse in following()
+  private int fragmentEndFollowingLengthGoalFromPreceding; // store the remaining length to collect in following()
 
   /** Breaks will be at least {@code minLength} apart (to the extent possible). */
-  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength) {
-    return new LengthGoalBreakIterator(baseIter, minLength, true);
+  public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength,
+                                                        float fragmentAlignment) {
+    return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true);
   }
 
   /** Breaks will be on average {@code targetLength} apart; the closest break to this target (before or after)
    * is chosen. */
-  public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength) {
-    return new LengthGoalBreakIterator(baseIter, targetLength, false);
+  public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength,
+                                                              float fragmentAlignment) {
+    return new LengthGoalBreakIterator(baseIter, targetLength, fragmentAlignment, false);
   }
 
-  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, boolean isMinimumLength) {
+  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, float fragmentAlignment,
+                                  boolean isMinimumLength) {
     this.baseIter = baseIter;
     this.lengthGoal = lengthGoal;
+    this.fragmentAlignment = Math.max(Math.min(fragmentAlignment, 1.f), 0.f);
 
 Review comment:
   I would prefer an IllegalArgumentException here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361853561
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -167,68 +146,70 @@ public int previous() {
 
   @Override
   public int following(int matchEndIndex) {
-    final int targetIdx = (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment));
+    return following(matchEndIndex, (matchEndIndex + 1) + (int)(lengthGoal * (1.f - fragmentAlignment)));
+  }
+
+  private int following(int matchEndIndex, int targetIdx) {
     if (targetIdx >= getText().getEndIndex()) {
-      return baseIter.last();
+      if (currentCache == baseIter.last()) {
+        return DONE;
+      }
+      return currentCache = baseIter.last();
     }
     final int afterIdx = baseIter.following(targetIdx - 1);
     if (afterIdx == DONE) {
-      return baseIter.current();
+      currentCache = baseIter.last();
+      return DONE;
     }
     if (afterIdx == targetIdx) { // right on the money
-      return afterIdx;
+      return currentCache = afterIdx;
     }
     if (isMinimumLength) { // thus never undershoot
-      return afterIdx;
+      return currentCache = afterIdx;
     }
 
     // note: it is a shame that we invoke preceding() *one more time*; BI's are sometimes expensive.
 
     // Find closest break to target
     final int beforeIdx = baseIter.preceding(targetIdx);
     if (targetIdx - beforeIdx < afterIdx - targetIdx && beforeIdx > matchEndIndex) {
-      return beforeIdx;
+      return currentCache = beforeIdx;
     }
-    return afterIdx;
-  }
-
-  private int moveToBreak(int idx) { // precondition: idx is a known break
-    // bi.isBoundary(idx) has side-effect of moving the position.  Not obvious!
-    //boolean moved = baseIter.isBoundary(idx); // probably not particularly expensive
-    //assert moved && current() == idx;
-
-    // TODO fix: Would prefer to do "- 1" instead of "- 2" but CustomSeparatorBreakIterator has a bug.
-    int current = baseIter.following(idx - 2);
-    assert current == idx : "following() didn't move us to the expected index.";
-    return idx;
+    // moveToBreak is necessary for when getSummaryPassagesNoHighlight calls next and current() is used
 
 Review comment:
   No longer need to reference "moveToBreak"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   The change is the same in WORD mode too, but the match and the fragment of the match is mostly the same. It's just not that apparent compared to the SENTENCE mode.
   
   I'll try and clarify all the expressions:
   - Source text: `I'm looking for a beach ball for my vacation. I've lost the old one. I liked that.`
   - Query text: `vacation`
   - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is `8`
   - Match-fragment is the fragment that contains the match according to the underlying BI:
     - WORD: `vacation` because it breaks on space and period.
     - SENTENCE: `I'm looking for a beach ball for my vacation.` because it breaks on period.
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=0.0`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 1 extra char on the right, but 10 is requested. Will try to append next fragments to reach `match.end+((1-fragalign)-fragsize)` (which is `54`).
     - Next break after `54` is the end of the second sentence so the result snippet will be: `I'm looking for a beach ball for my vacation. I've lost the old one.`
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=0.5`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 5 is requested. Also 5 are requested on the right so the extension will happen just like in the previous example. The result will be `I'm looking for a beach ball for my vacation. I've lost the old one.`.
   - `hl.bs.type=SENTENCE&hl.fragsize=10&hl.fragalign=1.0`:
     - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 10 is requested. No extra chars are requested on the right so the match fragment is the result.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361530659
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0));
 
 Review comment:
   Even if this was not necessary before, the current version with `lastPassageEnd` does need `max()`. Without it the passages could overlap.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361533868
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment);
 
 Review comment:
   On the contrary!
   
   This suggestion is a little different than what I've thought of, so I'll have to rework the whole thing, but it has a great quality: This would make the length of the match irrelevant for the GLBI logic. I'm not going to go through all the details, but the code can be a little more optimal and less... hacky.
   
   I should have read this before writing [my previous comment](https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545), now it is deprecated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361472335
 
 

 ##########
 File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##########
 @@ -41,6 +43,31 @@
   //                      01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
 
+  public void testFragmentAlignmentConstructor() throws IOException {
+    BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+    // test fragmentAlignment validation
+    float[] valid_aligns = {0.f, 0.3333f, 0.5f, 0.99f, 1.f};
+    for (float alignment : valid_aligns) {
+      LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+    }
+    float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+    for (float alignment : invalid_aligns) {
+      try {
+        LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+        Assert.fail("Expected IllegalArgumentException for "+alignment);
+      } catch (IllegalArgumentException e) {
+        if (!e.getMessage().contains("fragmentAlignment")) {
+          throw e;
+        }
+      }
+    }
+    // test backwards compatibility constructors
+    String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
 
 Review comment:
   Impressively thorough

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361839121
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0));
-        passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength));
+        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd));
 
 Review comment:
   > Without this small proposal, the length of this passage will be undersized
   
   That's incorrect. Such a fragment will be undersized either way. The current approach has the `fragsize` split up by `fragAlignRatio` statically. Even if there is not fulfilled expansion on the left, that won't be used on the right. We would only be moving the point where the `fragsize` on the left is truncated.
   
   BTW in the results these would-be-overlapping fragments get merged into a single snippet. So they'll become a bigger one instead of one normal- and one undersized. The only sure place we will receive an undersized snippet is when a match is at the very beginning or the end of the text.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361476978
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0));
-        passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength));
+        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd));
+        lastPassageEnd = Math.min(this.breakIterator.following(start), contentLength);
 
 Review comment:
   What I'm about to propose is perhaps technically out of scope but can you please call `following` with the variable `end` minus 1?
   Do you think we might need to guard the passage end offset by ensuring it is > the start offset?  start might theoretically be before lastPassageEnd I think; if it wasn't possible then you wouldn't have guarded that with `max(...,lastPassageEnd)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361538041
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0));
-        passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength));
+        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd));
+        lastPassageEnd = Math.min(this.breakIterator.following(start), contentLength);
 
 Review comment:
   RE "end": I should have made it clear that I make the small request in awareness it will not have any value when this particular LengthGoalBreakIterator is used, at least as is written now.  But in principle it makes sense and who knows what BI a user might supply.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley closed pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley closed pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361828293
 
 

 ##########
 File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -146,18 +182,14 @@ public int following(int followingIdx) {
       return afterIdx;
     }
 
-    // note: it is a shame that we invoke preceding() *in addition to* following(); BI's are sometimes expensive.
-
-    // Find closest break < target
-    final int beforeIdx = baseIter.preceding(targetIdx); // or could do baseIter.previous() but we hope the BI implements preceding()
-    if (beforeIdx <= followingIdx) { // too far back
-      return moveToBreak(afterIdx);
-    }
+    // note: it is a shame that we invoke preceding() *one more time*; BI's are sometimes expensive.
 
-    if (targetIdx - beforeIdx <= afterIdx - targetIdx) {
+    // Find closest break to target
+    final int beforeIdx = baseIter.preceding(targetIdx);
+    if (targetIdx - beforeIdx < afterIdx - targetIdx && beforeIdx > matchEndIndex) {
       return beforeIdx;
     }
-    return moveToBreak(afterIdx);
 
 Review comment:
   Maybe removing the moveToBreak here is an optimization attempt but it feels wrong.  Also current() accuracy will become necessary if you look at my other comment about a small proposal to use it so this BI in preceding() doesn't get look too far left.
   
   It'd be nice if moveToBreak didn't need to exist.  I could imagine a BI wrapper that you could set it's position on but it wouldn't propagate.  Methods whose semantics are based on current() (like next()) would need to act differently.  Kind of a lot of code though.  I took a stab at it for fun: https://gist.github.com/dsmiley/b425b152d51d4c63f498fc84d125ea0a

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org