You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dawid Weiss (Jira)" <ji...@apache.org> on 2022/04/08 10:43:00 UTC

[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

    [ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519509#comment-17519509 ] 

Dawid Weiss commented on LUCENE-10229:
--------------------------------------

I finally found some time to take a closer look at what's happening. The reason extend does not work for highlighting is that, quite reasonably, it can only return the offsets delegated from the source interval. Once you shift left or right from the source interval's position, the offset information cannot be retrieved (because this would require per-document, random-access position-offset map to be present somewhere).

This said, I don't think the expand source interval should _lie_ about knowing the offsets in the returned IntervalMatchesIterator - it should just return -1 for the start or end offset it does not know, as per the contract specified in MatchesIterator:
{code:java}
* The starting offset of the current match, or {@code -1} if offsets are not available {code}
If we modify the implementation ot expand source interval to correctly pass the "offset unknown" information, we can then modify the matches-based highlighting to behave properly depending on whether the offsets are correct and known or not (in which case we can fallĀ  back to recomputing them from positions).

This PR implements the above concepts (and passes all tests for me).

https://github.com/apache/lucene/pull/803

> Match offsets should be consistent for fields with positions and fields with offsets
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENE-10229
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10229
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Major
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields with
> offsets don't highlight some more complex interval queries properly.  Alan says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of the outer.  And so if you're re-analyzing and retrieving offsets by looking at the positions, you get the 'right' thing.  It's not obvious to me what the correct response is here, but thinking about it the current behaviour is kind of the worst of both worlds, and perhaps we should change it so that you get offsets of the inner match as standard, and then the outer match is returned as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend where they're highlighted is that filters are restrictions that should not be highlighted - it's the source intervals that people care about. Filters are what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight examples (on fields with positions) where this intuition is demonstrated on all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org