You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Christoph Goller (Jira)" <ji...@apache.org> on 2020/07/14 11:02:00 UTC

[jira] [Commented] (LUCENE-9426) UnifiedHighlighter does not handle SpanNotQuery correctly.

    [ https://issues.apache.org/jira/browse/LUCENE-9426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157281#comment-17157281 ] 

Christoph Goller commented on LUCENE-9426:
------------------------------------------

Analysis:

 

With PostingsOffsetStrategy highlighting for SpanNotQuery works correctly.

 

With MemoryIndexOffsetStrategy UnifiedHighligher creates an In-Memory Index of the document that must be highlighted. However, it does not use the tokenstream produced by the indexAnalyzer. Instead it aplies a FilteringTokenFilter throwing away all tokens that do not occur in the query. I guess this is done for efficiency reasons. The filter is based on an automaton that is built by MultiTermHighlighting. MultiTermHighlighting is based on the Visitor concept and it ignores all subqueries that have BooleanClause.Occur.MUST_NOT. While this may be correct for a Boolean NOT-query, it is not correct for a SpanNotQuery. In the above example we need the SpanNot token. Otherwise the query logic is corrupted.

 

As a fix I recommend to add all tokens form the query even if they have BooleanClause.Occur.MUST_NOT. Still the index remains small, but query logic will be correct.

> UnifiedHighlighter does not handle SpanNotQuery correctly.
> ----------------------------------------------------------
>
>                 Key: LUCENE-9426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9426
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 8.5.1
>         Environment: I tested with 8.5.1, but other versions are probably also affected.
>            Reporter: Christoph Goller
>            Priority: Major
>              Labels: easyfix
>
> If UnifiedHighlighter uses MemoryIndexOffsetStrategy, it does not treat SpanNotQuery correctly.
> Since UnifiedHighlighter uses actual search in order to determine which locations to highlight, it should be consistent with search and only highlight locations in a document that really match the query. However, it does not for SpanNotQuery.
> For the query spanNot(spanNear([content:100, content:dollars], 1, true), content:thousand, 0, 0)
> it produces
> A <b>100</b> fucking <b>dollars</b> wasn't enough to fix it. ... We need <b>100</b> thousand <b>dollars</b> to buy the house



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org