You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2018/08/13 19:09:00 UTC

[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

    [ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578802#comment-16578802 ] 

David Smiley commented on LUCENE-8286:
--------------------------------------

Made substantial progress to the PR:
{noformat}
LUCENE-8286 UH: Use MI.getSubMatches().  Removed PhraseHelper changes; not necessary anymore.
Updated based on MI improvements in master.
With subMatches, we have better fidelity on span queries.
And since MI can handle span queries now, no need to touch PhraseHelper.
* added to UHComponents: query, and highlightFlags
* updated tests to handle with/without WEIGHT_MATCHES
* TestUnifiedHighlighterStrictPhrases uses more randomization.
  Removed brittle score calculation dependence.
* Test Passage matches data is in order
TODO: OE freq & term()
{noformat}
It was nice to see that UH's PhraseHelper can be circumvented now.  Handling mi.getSubMatches proved to be difficult, but I ultimately got it working.  See https://github.com/dsmiley/lucene-solr/blob/LUCENE-8286/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/OffsetsEnum.java#L168

Next up is handling OffsetsEnum.getTerm().  I could change the API so that getTerm() returns getQuery() and consequently update Passage & PassageScorer.  Callers of getTerm() were all internal or considered experimental any way (definitely not in common use) so I think it could change in a minor release.  I hope multi-term query types will be retained as such but I fear MatchesIterator expands before retaining the original, and thus the results here won't be as ideal but adequate.

Then, OffsetsEnum.freq().  This one is hard.  We could make "-1" an unsupported value.  Then, a new PassageScorer design that is created per highlighted field value could be given access to the IndexReader in org.apache.lucene.search.uhighlight.FieldHighlighter#highlightOffsetsEnums.  When it sees -1 at scoring time, it could calculate the in-doc freq and cache it.  Or similarly... maybe we don't care that much about the in-doc freq; it may be expensive to calculate any way.  Maybe we want the associated Query's score for this document (which will consider global stats like IDF), but again will need access to the IndexReader.  It'd be nice if boosts wrapped around the query could be considered but it's just not there (also true without MI mode).

> UnifiedHighlighter should support the new Weight.matches API for better match accuracy
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8286
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing the LOC and related complexities, especially the UH's PhraseHelper.  Note: reducing/removing PhraseHelper is not a near-term goal since Weight.matches is experimental and incomplete, and perhaps we'll discover some gaps in flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}?  Longer term it could go away and it'll be implied if you specify enum values for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org