You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2019/12/13 21:15:00 UTC

[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

    [ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995916#comment-16995916 ] 

David Smiley commented on LUCENE-9093:
--------------------------------------

This is a very thoughtful response [~myusername8].  I'm really glad you are willing to contribute :-)

An idea I have thought of before is to try to get more leading context before the first word.  Basically compute half the fragsize as the amount of leading text we'd like (configurable ratio).  Then keep looping over sub-BreakIterator calls to preceding() until we reach this target.  Strictly speaking, the BreakIterator generically has no concept of a highlighting "match" but these special-purpose BreakIterators are used in the concept of the UnifiedHighlighter and know that when preceding() is called, it's at the first match of a passage.  WDYT?  Unfortunately I think it would yield Passages that overlap, and that subsequent Passages would not contain the matches of the previous overlapping passages. :-/. Maybe this could be overcome by FieldHighlighter detecting this and adding the pertinent matches from the most recent Passage. 

I'm aware that the use of BreakIterator is limiting, constraining our solution space.  And it puts undo extra work on us to implement the JDK defined abstraction.  Perhaps like the FVH, the UH needs it's own abstraction here.  CC [~romseygeek]

> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9093
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9093
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Tim Retout
>            Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get context to the left of the matches returned; only words to the right of each match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is respecting the hl.fragsize parameter, although [SOLR-9935] suggests support was added.  I included the hl.fragsize param in the unified URL too, but it's making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org