You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Nándor Mátravölgyi (Jira)" <ji...@apache.org> on 2019/12/15 14:06:00 UTC
[jira] [Comment Edited] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

    [ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996736#comment-16996736 ] 

Nándor Mátravölgyi edited comment on LUCENE-9093 at 12/15/19 2:05 PM:
----------------------------------------------------------------------

We have the same idea how the chained breakiterators could be used to align the match in a more pleasing way. I also agree that some changes to FieldHighlighter will be necessary to handle overlaps. Your suggestion of that is about also highlighting the matches that were included in a previous Passage. I'd think trying to completely avoid the overlaps is preferable. That would make the snippets not redundant and implicitly solve the issue of needing to highlight some matches more than one time.

These are examples of what the least favorable edge cases would look like when we strictly avoid overlaps, but want to have centered match alignment. The search query is "field" and the original text is:
{noformat}
If set to false, or if there is no match in the alternate field either, the alternate field will be shown without highlighting, but could be marked by other processors.{noformat}
If the search has fragsize around 50 the first "field" word will be aligned properly. The next one will be left-aligned because the preceding text has already been used for a passage.
{noformat}
[
  "in the alternate <b>field</b> either, the alternate",
  "<b>field</b> will be shown without highlighting, but"
]{noformat}
If the search has fragsize around 60 the first "field" word will be aligned properly. The next one will be right-aligned because it is at the very end of the passage made for the first match.
{noformat}
[
 "match in the alternate <b>field</b> either, the alternate <b>field</b>"
]{noformat}
Now the question is: which of these is closer to what we want to see? I'd say either "worst" edge case would be much better than the constantly left-aligned matches we have currently. Note: these are close to how the other highlighters behave when they have near-boundary matches.

Regarding the question of abstraction. I've not found a reason to think we need to replace the breakitartors with a new interface. I think the bulk of the fastVector's fragment builder abstraction is about tracking the matches and highlighting the terms with different styles. (note I've only looked through it briefly)

Just for the sake of completeness, I'll tell you that for what I would like to do, a different concept of fragment length and snippet limit would be better. In all honesty I want an excerpt of the document that shows valuable matches in the context of a few words around them, while the whole highlight is no longer than N characters. Right now I have the configuration of fragsize=90 and snippets=3 because I want something that's not longer than 300 chars. If the highlighter could determine what differently sized fragments would yield the best excerpt, that would be the "best". A dense cluster of matches could form a 180 chars fragment while two singular matches would form two 50 chars fragment. This could be better than forcing the fragments to be uniform in size.


was (Author: myusername8):
We have the same idea how the chained breakiterators could be used to align the match in a more pleasing way. I also agree that some changes to FieldHighlighter will be necessary to handle overlaps. Your suggestion of that is about also highlighting the matches that were included in a previous Passage. I'd think trying to completely avoid the overlaps is preferable. That would make the snippets not redundant and implicitly solve the issue of needing to highlight some matches more than one time.

These are examples of what the least favorable edge cases would look like when we strictly avoid overlaps, but want to have centered match alignment. The search query is "field" and the original text is:

 
{noformat}
If set to false, or if there is no match in the alternate field either, the alternate field will be shown without highlighting, but could be marked by other processors.{noformat}
If the search has fragsize around 50 the first "field" word will be aligned properly. The next one will be left-aligned because the preceding text has already been used for a passage.

 

 
{noformat}
[
  "in the alternate <b>field</b> either, the alternate",
  "<b>field</b> will be shown without highlighting, but"
]{noformat}
 

If the search has fragsize around 60 the first "field" word will be aligned properly. The next one will be right-aligned because it is at the very end of the passage made for the first match.
{noformat}
[
 "match in the alternate <b>field</b> either, the alternate <b>field</b>"
]{noformat}
Now the question is: which of these is closer to what we want to see? I'd say either "worst" edge case would be much better than the constantly left-aligned matches we have currently. Note: these are close to how the other highlighters behave when they have near-boundary matches.

Regarding the question of abstraction. I've not found a reason to think we need to replace the breakitartors with a new interface. I think the bulk of the fastVector's fragment builder abstraction is about tracking the matches and highlighting the terms with different styles. (note I've only looked through it briefly)

Just for the sake of completeness, I'll tell you that for what I would like to do, a different concept of fragment length and snippet limit would be better. In all honesty I want an excerpt of the document that shows valuable matches in the context of a few words around them, while the whole highlight is no longer than N characters. Right now I have the configuration of fragsize=90 and snippets=3 because I want something that's not longer than 300 chars. If the highlighter could determine what differently sized fragments would yield the best excerpt, that would be the "best". A dense cluster of matches could form a 180 chars fragment while two singular matches would form two 50 chars fragment. This could be better than forcing the fragments to be uniform in size.

> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9093
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9093
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Tim Retout
>            Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get context to the left of the matches returned; only words to the right of each match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is respecting the hl.fragsize parameter, although [SOLR-9935] suggests support was added.  I included the hl.fragsize param in the unified URL too, but it's making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org