You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2019/09/10 21:13:00 UTC
[jira] [Updated] (SOLR-1954) Highlighter component should expose snippet character offsets and the score.

     [ https://issues.apache.org/jira/browse/SOLR-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated SOLR-1954:
-------------------------------
    Attachment: SOLR-1954.patch
      Assignee: David Smiley
        Status: Open  (was: Open)

At the Lucene/Solr Hackday event, I worked on this for the Unified highlighter.  I'm attaching a patch that is very much WIP but basically works.  It adds a "hl.extended" boolean flag which will mean a structured detailed response in place of the list of snippets.  TODOs:
* Expose more info; I just did a couple things.
* Probably make the format nicer.  Definitely some rough edges in this code; TODOs and WIP bits are there.  Tidying up to do still.
* SolrJ QueryResponse
* Ref guide

> Highlighter component should expose snippet character offsets and the score.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1954
>                 URL: https://issues.apache.org/jira/browse/SOLR-1954
>             Project: Solr
>          Issue Type: New Feature
>          Components: highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Minor
>         Attachments: SOLR-1954.patch, SOLR-1954_start_and_end_offsets.patch
>
>
> The Highlighter Component does not currently expose the snippet character offsets nor the score.  There is a TODO in DefaultSolrHighlighter indicating the intention to add this eventually.  This information is needed when doing highlighting on external content.  The data is there so its pretty easy to output it in some way.  The challenge is deciding on the output and its ramifications on backwards compatibility.  The current highlighter component response structure doesn't lend itself to adding any new data, unfortunately.  I wish the original implementer had some foresight.  Unfortunately all the highlighting tests assume this structure.  Here is a snippet of the current response structure in Solr's sample data searching for "sdram" for reference:
> {code:xml}
> <lst name="highlighting">
>  <lst name="VS1GB400C3">
>   <arr name="text">
> 	<str>CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt; Unbuffered DDR 400 (PC 3200) System Memory - Retail</str>
>   </arr>
>  </lst>
> </lst>
> {code}
> Perhaps as a little hack, we introduce a pseudo field called text_startCharOffset which is the concatenation of the matching field and "_startCharOffset".  This would be an array of ints.  Likewise, there would be another array for endCharOffset and score.
> Thoughts?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org