You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Nándor Mátravölgyi (Jira)" <ji...@apache.org> on 2019/12/13 16:09:00 UTC
[jira] [Commented] (SOLR-11516) Unified highlighter with word separator never gives context to the left

    [ https://issues.apache.org/jira/browse/SOLR-11516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995738#comment-16995738 ] 

Nándor Mátravölgyi commented on SOLR-11516:
-------------------------------------------

As previously stated the UnifiedHighlighter always returns full sentences (with SENTENCE bs.type), effectively not adhering to the fragsize parameter. Changing the breakiterator type to WORD makes the fragsize work as expected, but the matches are not "centered" in the snippets, making their context much less apparent in some cases.

The trimming on the client-side is relatively a bad solution in my opinion. Let's say I receive a highlight that is several times longer than I want, but the matches are very unevenly distributed because of a long sentence: OXOOOOOOOOOXXX (imagine X represents the matches and O the text around them). The client (to truly be correct) has to parse the highlight for the pre-post tags and strip the middle of the text. In a more primitive solution the highlight would be bluntly truncated and the valuable matches at the end are lost to the client. This work is redundant and wasteful if solr could do it like in the other highlighters.

I really wanted to have the fragsize be around what I specified even with much longer sentences, so I've spent some time analyzing the code and designing possible changes.

The UnifiedHighlighter chains the selected breakiterator instance requested by the hl.bs.type parameter with a LengthGoalBreakIterator. (unless fragsize <= 1 or type == WHOLE) It is actually the LengthGoalBreakIterator that decides what parts should be in the snippet around the actual match.

Currently this class always starts the snippet from the first break before the match indicated by the wrapped iterator, and may only extend the snippet beyond the match until fragsize is reached.

There is a "closestTo" mode implemented in it, but it's always starts like the used one and it is not selectable because it would require some additional missing parameter. ([view on github|https://github.com/apache/lucene-solr/blob/e5df183a42967c0eb79b5c2c65cd3ab618318f23/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330])

So far I can see two ways to improve this:
 # Improve the LengthGoalBreakIterator to have a "centerAround" mode. This has the benefit of working with all other hl.bs.types. Even though it would mostly be meaningful for SEPARATOR and WORD. In SENTENCE mode a great enough fragsize could include a preceding sentence in the snippet as well. To use this mode a new parameter has to be created. Something like "hl.bs.snippetAlignment" maybe, which could have the values of "min" - current behavior, "closest" - currently unreachable and "center" - the proposed behavior.
 # Make a new hl.bs.type, AROUND_MATCH maybe and create a different breakiterator wrapper to be used instead of the LengthGoalBreakIterator. This would wrap a WORD brakeiterator thus producing similar results to the other highlighters.

One question is if the passage (ultimately snippet) extractor algorithm in FieldHighlighter needs to change. Currently because no breakiterator looks before the match for a passage start position, it is guaranteed that the passages will have no overlap. This is something that would not be the case after the changes, and may also need some work. (interestingly the fastVerctor highlighter can produce slight overlaps if the matches are dense enough, while the original will not)

I'm pretty sure either can be done with minimal overhead since all data is already available. The algorithms just need to make different decisions where to slice the strings. I'm willing to work on this, so please share your ideas.

> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
>                 Key: SOLR-11516
>                 URL: https://issues.apache.org/jira/browse/SOLR-11516
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 6.4, 7.1
>            Reporter: Tim Retout
>            Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get context to the left of the matches returned; only words to the right of each match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is respecting the hl.fragsize parameter, although [SOLR-9935] suggests support was added.  I included the hl.fragsize param in the unified URL too, but it's making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org