You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2018/01/06 15:59:01 UTC
[jira] [Updated] (LUCENE-8121) UnifiedHighlighter can highlight terms within SpanNear clauses at unmatched positions

     [ https://issues.apache.org/jira/browse/LUCENE-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated LUCENE-8121:
---------------------------------
    Attachment: LUCENE-2287_UH_SpanCollector.patch

This was exciting to work on since it led to a lot of simplifications.  I wish I could have built it to work this way to begin with were it not for a requirement to support a version of Solr that didn't have the SpanCollector.

All changes are purely internal to this highlighter with no API changes except a little to PhraseHelper and OffsetsEnum which are only public for advanced uses.
* Replaced half of PhraseHelper with new code that uses the SpanCollector API, which is the aspect of this patch that fundamentally addresses the parent issue/bug.  It's much less LOC and it's simpler too (albeit it there remains complexity in the constructor with it's awkward relationship with WSTE).
* Instead of FieldOffsetStrategy.createOffsetsEnumsFromReader
using PhraseHelper to help _filter_ PostingsEnums that it already seek'ed, it now lets PhraseHelper handle the position-sensitive parts completely, collecting the underlying offsets into OffsetsEnums it creates.  This is simpler and probably faster as there's no double-traversal of PostingsEnums.
** I stole Luwak's SpanExtractor utility, putting its two methods onto PhraseHelper.  It's ASL licensed although copyrighted to Lemur.  [~romseygeek] can I incorporate this into Lucene without the copyright statement?
* Refactored OffsetsEnum to be an abstract class with several impls.  This addresses TODOs that make TokenStreamOffsetStrategy less hacky and make it easier in this patch to add a new type of OffsetsEnum.  I also removed hasMorePositions() and instead had nextPosition return a boolean -- simpler.
* Ported the test from LUCENE-5455

I looked at some related code in Luwak.  I think I made two improvements in this patch versus Luwak.  Firstly ForceNoBulkScoringQuery isn't needed here since PhraseHelper directly accesses the weight & scorer.  Secondly SpanOffsetReportingQuery isn't needed since we can more easily wrap the underlying PostingsEnum (one place to wrap versus every SpanTermQuery).

I have a nocommit in PhraseHelper.OffsetSpanCollector.  It's using bulky List<Integer> for the offsets.  I could convert it to an int[].  Or I might create some new Hit class (as Luwak does) to thereby make it easier for advanced users to add new information to the hit (perhaps payloads)... but it's so internal that it it'll be awkward to actually support such a use-case completely.  Also, I believe I need to ensure the collected information is de-duplicated and sorted (as Luwak does).  It would be good to have a test exercising this possibility.

There will be follow-on work for LUCENE-7903 which can leverage the progress here and perhaps using Luwak's SpanRewriter.

> UnifiedHighlighter can highlight terms within SpanNear clauses at unmatched positions
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8121
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8121
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 7.3
>
>         Attachments: LUCENE-2287_UH_SpanCollector.patch
>
>
> The UnifiedHighlighter (and original Highlighter) highlight phrases by converting to a SpanQuery and using the Spans start and end positions to assume that every occurrence of the underlying terms between those positions are to be highlighted.  But this is inaccurate; see LUCENE-5455 for a good example, and also LUCENE-2287.  The solution is to use the SpanCollector API which was introduced after the phrase matching aspects of those highlighters were developed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org