You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2021/01/29 21:42:00 UTC
[jira] [Commented] (LUCENE-9712) UnifiedHighlighter, optimize WEIGHT_MATCHES when many fields

    [ https://issues.apache.org/jira/browse/LUCENE-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275373#comment-17275373 ] 

David Smiley commented on LUCENE-9712:
--------------------------------------

I suspect the culprit is in this method: {{org.apache.lucene.search.uhighlight.FieldOffsetStrategy#createOffsetsEnumsWeightMatcher }}

-- which will compute the Matches from the Weight – but this will be called once per field.  I suspect most of the cost is in re-computing this over and over again.  If we can assume there is no "getFieldMatcher" (i.e. assume {{hl.requireFieldMatch=true}}), and that offset source is from the actual index (no re-analysis), then the leafReader could be the same across fields, and thus the Matches would be re-usable.  But how to re-use it across fields?  There's no clear place nearby since this part of the code is very field-centric.  UHComponents is immutable; that could be changed to hold some Map.  Or, I was thinking maybe the Query could be wrapped with an impl that has a Weight that caches its Matches result for a given leafReader docId pair. Hmmmm.

This kind of highlights a structural challenge in the UH in which it is very field centric, and thus it's not clear where to share info across fields of the same doc.  Above I qualified some ideas that would only work for an index based offset source (in postings), but it'd suck not to handle re-analysis, which is popular.  Again, if there was a more document centric approach, then the underling MemoryIndex could be built across fields, which would then enable re-use of Matches since MI's leafReader would be the same.

> UnifiedHighlighter, optimize WEIGHT_MATCHES when many fields
> ------------------------------------------------------------
>
>                 Key: LUCENE-9712
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9712
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Priority: Major
>
> A user reported that highlighting many fields per document in WEIGHT_MATCHES mode is quite slow:   [https://lists.apache.org/thread.html/r152c74a884b5ff72f3d530fc452bb0865cc7f24ca35ccf7d1d1e4952%40%3Csolr-user.lucene.apache.org%3E]
> The query is a DisjunctionMax over many fields – basically the ones being highlighted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org