You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Richard Walker (Jira)" <ji...@apache.org> on 2019/10/21 02:45:00 UTC

[jira] [Commented] (SOLR-1954) Highlighter component should expose snippet character offsets and the score.

    [ https://issues.apache.org/jira/browse/SOLR-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955703#comment-16955703 ] 

Richard Walker commented on SOLR-1954:
--------------------------------------

Hey [~dsmiley], I'm very interested in this sort of thing. My particular need is something like this:
 * In my webapp's search results page I want to display the values of certain fields, which contain descriptive text that may be quite long.
 * If there's no highlighting info, display the first (say) 500 chars, followed by "...".
 * If there's highlighting info:
 ** If there's no highlight in the first (say) 300 chars, then display the first (say) 300 chars, then "...", then the first highlight snippet, then (if the snippet isn't at the end of the field!) another "..."
 ** Append (let's say) up to one more highlight snippet for the field.

Obviously I don't need Solr to do all of this work: i.e., I can do my own truncation of values returned in the response to 300/500/whatever characters; but I need Solr to give me the pieces to truncate and join together, and to tell me how.

The position info generated by your patch would seem to help to do this. But maybe there is a better way. (It wasn't clear that I could tweak the existing hl parameters of the unified highlighter to get what I want.)

Oh, and one more snag: the fields are multiValued. Therefore I need to be able to distinguish the separate values of such fields. (I don't want to get "abc ... xyz", where "abc" and "xyz" come from _different_ values of the field.) In this case, position values that are just character indexes are, let's say, "inconvenient".

 

> Highlighter component should expose snippet character offsets and the score.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1954
>                 URL: https://issues.apache.org/jira/browse/SOLR-1954
>             Project: Solr
>          Issue Type: New Feature
>          Components: highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Minor
>         Attachments: SOLR-1954.patch, SOLR-1954_start_and_end_offsets.patch
>
>
> The Highlighter Component does not currently expose the snippet character offsets nor the score.  There is a TODO in DefaultSolrHighlighter indicating the intention to add this eventually.  This information is needed when doing highlighting on external content.  The data is there so its pretty easy to output it in some way.  The challenge is deciding on the output and its ramifications on backwards compatibility.  The current highlighter component response structure doesn't lend itself to adding any new data, unfortunately.  I wish the original implementer had some foresight.  Unfortunately all the highlighting tests assume this structure.  Here is a snippet of the current response structure in Solr's sample data searching for "sdram" for reference:
> {code:xml}
> <lst name="highlighting">
>  <lst name="VS1GB400C3">
>   <arr name="text">
> 	<str>CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt; Unbuffered DDR 400 (PC 3200) System Memory - Retail</str>
>   </arr>
>  </lst>
> </lst>
> {code}
> Perhaps as a little hack, we introduce a pseudo field called text_startCharOffset which is the concatenation of the matching field and "_startCharOffset".  This would be an array of ints.  Likewise, there would be another array for endCharOffset and score.
> Thoughts?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org