You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Zach Chen (Jira)" <ji...@apache.org> on 2021/03/10 04:59:00 UTC

[jira] [Commented] (LUCENE-9634) Highlighting of degenerate spans on fields *with offsets* doesn't work properly

    [ https://issues.apache.org/jira/browse/LUCENE-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298529#comment-17298529 ] 

Zach Chen commented on LUCENE-9634:
-----------------------------------

Hi [~dweiss], I took a look at this issue and am also not sure what's the proper way of fixing it. I'm considering a few possible solutions below, but I am wondering if there's other better solution as well. Hence I would like to get your opinion on it before I proceed further (I can also open a PR for discussion if that's preferred).

For context, the root cause of the issue is that unlike positions read in *OffsetsFromPositions#get* with *MatchesIterator#startPosition* and *MatchesIterator#endPosition*, which accounts for *before* / *after* values properly through *ExtendedIntervalIterator#start* and *ExtendedIntervalIterator#end* respectively, ** offset read in *OffsetsFromMatchIterator#get* with *MatchesIterator#startOffset* and *MatchesIterator#endOffset* doesn't adjust the start and end offset with *before* / *after* values at all, hence the incorrect offset highlight and the test failure for *TestMatchRegionRetriever#testDegenerateIntervalsWithOffsets*. Looking at the other OffsetsRetrievalStrategy implementations such as *OffsetsFromTokens* and *OffsetsFromValues,* since they didn't store / use *before* / *after* values either, I suspect they may have the same issue (but I haven't tested them to confirm yet). 

For the solution to this, I'm considering the following two options:
 # Deprecate *OffsetsFromMatchIterator* with *OffsetsFromPositions*. These two appear to have similar implementations, and since supporting position adjustment with *before* / *after* values in *OffsetsFromMatchIterator* necessarily requires processing token position information as well, the processing work involved might be the same with *OffsetsFromPositions* if *before* / *after* are used. However, under "typical" scenarios where *before* / *after* adjustment is not needed, *OffsetsFromPositions* does do more work than *OffsetsFromMatchIterator* due to the conversion from position to offset at the end.
 # Implement *OffsetsFromMatchIterator* similar to *OffsetsFromTokens* and *OffsetsFromValues*, by explicitly analyzing and looping over token stream again. This does require the *before* / *after* values somehow become available in *OffsetsFromMatchIterator*, which may require some signature change.

Other option includes creating a new class similar to *ExtendedIntervalIterator*, but handle position adjustment within *MatchesIterator#startOffset* and *MatchesIterator#endOffset*  internally with token stream processing. But this option also appears to require changing quite a few signatures so it may not be ideal.

What do you think about the solutions above?

> Highlighting of degenerate spans on fields *with offsets* doesn't work properly
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-9634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9634
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>
> Match highlighter works fine with degenerate interval positions when {{OffsetsFromPositions}} strategy is used to compute offsets but will show incorrect offset ranges if offsets are read from directly from the {{MatchIterator}} ({{OffsetsFromMatchIterator}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org