You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Zach Chen (Jira)" <ji...@apache.org> on 2021/03/10 04:59:00 UTC
[jira] [Commented] (LUCENE-9634) Highlighting of degenerate spans
on fields *with offsets* doesn't work properly
[ https://issues.apache.org/jira/browse/LUCENE-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298529#comment-17298529 ]
Zach Chen commented on LUCENE-9634:
-----------------------------------
Hi [~dweiss], I took a look at this issue and am also not sure what's the proper way of fixing it. I'm considering a few possible solutions below, but I am wondering if there's other better solution as well. Hence I would like to get your opinion on it before I proceed further (I can also open a PR for discussion if that's preferred).
For context, the root cause of the issue is that unlike positions read in *OffsetsFromPositions#get* with *MatchesIterator#startPosition* and *MatchesIterator#endPosition*, which accounts for *before* / *after* values properly through *ExtendedIntervalIterator#start* and *ExtendedIntervalIterator#end* respectively, ** offset read in *OffsetsFromMatchIterator#get* with *MatchesIterator#startOffset* and *MatchesIterator#endOffset* doesn't adjust the start and end offset with *before* / *after* values at all, hence the incorrect offset highlight and the test failure for *TestMatchRegionRetriever#testDegenerateIntervalsWithOffsets*. Looking at the other OffsetsRetrievalStrategy implementations such as *OffsetsFromTokens* and *OffsetsFromValues,* since they didn't store / use *before* / *after* values either, I suspect they may have the same issue (but I haven't tested them to confirm yet).
For the solution to this, I'm considering the following two options:
# Deprecate *OffsetsFromMatchIterator* with *OffsetsFromPositions*. These two appear to have similar implementations, and since supporting position adjustment with *before* / *after* values in *OffsetsFromMatchIterator* necessarily requires processing token position information as well, the processing work involved might be the same with *OffsetsFromPositions* if *before* / *after* are used. However, under "typical" scenarios where *before* / *after* adjustment is not needed, *OffsetsFromPositions* does do more work than *OffsetsFromMatchIterator* due to the conversion from position to offset at the end.
# Implement *OffsetsFromMatchIterator* similar to *OffsetsFromTokens* and *OffsetsFromValues*, by explicitly analyzing and looping over token stream again. This does require the *before* / *after* values somehow become available in *OffsetsFromMatchIterator*, which may require some signature change.
Other option includes creating a new class similar to *ExtendedIntervalIterator*, but handle position adjustment within *MatchesIterator#startOffset* and *MatchesIterator#endOffset* internally with token stream processing. But this option also appears to require changing quite a few signatures so it may not be ideal.
What do you think about the solutions above?
> Highlighting of degenerate spans on fields *with offsets* doesn't work properly
> -------------------------------------------------------------------------------
>
> Key: LUCENE-9634
> URL: https://issues.apache.org/jira/browse/LUCENE-9634
> Project: Lucene - Core
> Issue Type: Sub-task
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
>
> Match highlighter works fine with degenerate interval positions when {{OffsetsFromPositions}} strategy is used to compute offsets but will show incorrect offset ranges if offsets are read from directly from the {{MatchIterator}} ({{OffsetsFromMatchIterator}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org