You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2020/09/05 05:11:00 UTC
[jira] [Commented] (LUCENE-9461) Query hit highlighting components on top of matches API

    [ https://issues.apache.org/jira/browse/LUCENE-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190981#comment-17190981 ] 

David Smiley commented on LUCENE-9461:
--------------------------------------

Maybe not as a sub-task, but would it make sense to modify the UnifiedHighlighter to use some of these components, thereby reducing redundancy?  As I say this, I look at some of these new components and maybe not (yet)... but maybe I'll see it better once you get to the example task.

> Query hit highlighting components on top of matches API
> -------------------------------------------------------
>
>                 Key: LUCENE-9461
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9461
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: master (9.0)
>
>
> Highlighters. Eventually, you'll have to face them. 
> When a Lucene Query is ran over an index, it implies a list of documents that "matched it" - literally a boolean indication of whether the document should be included in the search result or not. In practice, many applications need to convey to users not just the fact that a document matched the query but also some sort of intuitive explanation of *why* this particular query matched it. While in many cases the relationship is trivial (term containment), in case of complex queries it may not be trivial at all (think of a really short prefix query, a fuzzy term query or even a Boolean disjunction with a high number of possibilities).
> Historically, search engines used to "highlight" the source area of a document that caused the "hit". If a document was too long, it was truncated and only the area around the hit (or hits) was displayed (so called "snippet").
> In my subjective opinion, in the Lucene API highlighters have played a secondary role to queries and search. And once you're trying to build something higher-level, highlighters are a crucial and necessary element of the entire system. 
> My experience (and users feedback) from an implementation of a document retrieval system where highlighting was involved was that it just didn't work as expected. Here are the requirements of that system:
> * the query parser uses default field expansion into multiple fields (there is no single "sink" field),
> * the highlights should match *exactly* what caused the hit; a search for 'title:foo' must not highlight foo in any other field,
> * the set of fields to be highlighted isn't really fixed - there are some fields that should always be displayed - title, summary - and others that should not be displayed unless they're part of the query (in which case the highlight is important and should be shown to the user).
> * highlights should be accurate for all sorts of queries: fuzzy, phrase, prefix, Boolean, spans, etc.,
> * there can be more than one query at one time and they should highlight the same content (with different colors).
> Many highlighters are available in Lucene (vector highlighter, postings highlighter, unified highlighter) but none of them quite fit the bill above. Believe me - we have tried (hard). We ended up using unified highlighter but with subclassing, customizations and all sorts of complex, low-level quirks. 
> My gut feeling at that point was that it should be the Query that somehow *exposes* the information about how a given field content matched. Then I looked at matches API and built a quick prototype retrieving "match regions" on top of that. It works like magic. Here are the key insights:
> * matches API returns exactly what a highlighter needs: for a given query it iterates over fields and positions (including offsets, if they are available) that caused a document to be included in the search result,
> * when matches API cannot provide offsets, it provides elements from which offsets can be computed: positions by re-analyzing the field's value, for example.
> * in extreme cases it may happen the matches API doesn't provide anything useful (a field only indexed, with no stored field value, no positions, no offsets) but I assume it is up to the application layer to know how to deal with this then (or not deal with it at all and throw an exception).
> * matches API delegates the work of providing proper match ranges to the query itself (actually, to the weight a query produces), it doesn't need to know anything about different implementations and their specifics.
> The absolute *key* element is the last one. Once you build match region retriever, highlighting is a merely about organizing match ranges, dealing with potential overlaps, and proper formatting. It becomes a simple, tractable problem separated from the internals of Lucene Queries.
> The initial set of "highlighter components" in this issue is a set of classes that allows one to assemble a complete pipeline from any query into a set of highlighted document fields. Any highlighter can be essentially built by assembling the following steps:
> * retrieving documents and their fields/ match ranges, given [Query, IndexSearcher],
> * sanitizing match ranges (overlaps, etc.),
> * selecting the "best" snippet for the given set of match ranges,
> * formatting the output (adding start/ end tags for snippets, ellipsis between values, etc.).
> This issue implements components for all of the above steps. It isn't about one highlighter class with tons of options, it's about bits and pieces that can be put together to build anything one desires. This said, an example "high level" highlighter class will also be provided as a sub-task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org