You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Juan Fernando Mora <im...@gmail.com> on 2016/10/11 06:15:14 UTC

Highlight partial match

Hi,

I have been doing some research on highlighting partial matches, there are
some information on google but is far from complete and I just can't get it
to work.

*I have highlighting working but it highlights complete words, example:*


*http://localhost:8983/solr/pcsearch/select?indent=on&q=comput&wt=json
<http://localhost:8983/solr/pcsearch/select?indent=on&q=comput&wt=json>*

*will produce:*

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"comput",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "_text_":["Some Fancy Description and Features"],
        "id":"9332346275143077",
        "name":["Fancy Computer"],
        "type":"product",
        "_version_":1547871793506680832}]
  },
  "highlighting":{
    "9332346275143077":{
      "name":["Fancy <em>Computer</em>"]}}}


*What I'm trying to get is something like  "name":["Fancy
<em>Comput</em>er" only the partial match.*

*I have tried a few things but I keep missing something, my requestHandler
in  solrconfig.xml looks like this:*

<requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <int name="rows">10</int>
      <str name="hl">on</str>
      <str name="hl.fl">name,city,state</str>
      <str name="hl.useFastVectorHighlighter">true</str>
      <str name="hl.snippets">100</str>
      <str name="hl.usePhraseHighlighter">true</str>
      <str name="hl.fragsize">100</str>
    </lst>
</requestHandler>


*The name field in schema.xml:*

<field name="name" type="text_basic" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true"/>


*And finally where I'm almost sure the error lies:*

      <fieldType name="text_basic" class="solr.TextField"
positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" catenateAll="1"
preserveOriginal="1" generateWordParts="1" />
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" />
          </analyzer>

          <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" />
          </analyzer>
      </fieldType>

*I don't need stemming for my final application.*

Any help or pointer in the right direction you can provide will be
appreciated.

Thanks!

--Juan Fernando

Re: Highlight partial match

Posted by Juan Fernando Mora <jf...@juanfernando.com.mx>.
Well, that would explain it,

I hand't noticed the start and end values, I'm not experienced with
analysis,
but this is really interesting, I will look into this,

Thanks a lot Shawn!

On Tue, Oct 11, 2016 at 7:31 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/11/2016 12:15 AM, Juan Fernando Mora wrote:
> > Hi, I have been doing some research on highlighting partial matches,
> > there are some information on google but is far from complete and I
> > just can't get it to work. *I have highlighting working but it
> > highlights complete words, example:*
>
> I have no experience with highlighting, but I think the reason that this
> happens is because of how the Lucene index (specifically in this case,
> the EdgeNGram filter) stores information in the index.  I put your
> fieldType into a 6.2 example index and did index analysis on
> "computer".  This is the result:
>
> https://www.dropbox.com/s/ph524b8ij1hk28o/solr-analysis-
> computer-edgengrams.png?dl=0
>
> Notice how every term has a start value of "0" and an end value of "8"
> ... this is the character position inside the original indexed text.
> Every term resolves to the original source text of "computer".
>
> I believe these start/end values in the index are how highlighting
> decides *what* to highlight, though I admit I could have a flawed
> understanding of how it works.  If my understanding is correct, then
> obtaining what you want would involve an alternate NGram filter that
> writes different start/end values.  I'm don't think that an alternative
> like this to EdgeNGram exists.
>
> Thanks,
> Shawn
>
>

Re: Highlight partial match

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/11/2016 12:15 AM, Juan Fernando Mora wrote:
> Hi, I have been doing some research on highlighting partial matches,
> there are some information on google but is far from complete and I
> just can't get it to work. *I have highlighting working but it
> highlights complete words, example:* 

I have no experience with highlighting, but I think the reason that this
happens is because of how the Lucene index (specifically in this case,
the EdgeNGram filter) stores information in the index.  I put your
fieldType into a 6.2 example index and did index analysis on
"computer".  This is the result:

https://www.dropbox.com/s/ph524b8ij1hk28o/solr-analysis-computer-edgengrams.png?dl=0

Notice how every term has a start value of "0" and an end value of "8"
... this is the character position inside the original indexed text. 
Every term resolves to the original source text of "computer".

I believe these start/end values in the index are how highlighting
decides *what* to highlight, though I admit I could have a flawed
understanding of how it works.  If my understanding is correct, then
obtaining what you want would involve an alternate NGram filter that
writes different start/end values.  I'm don't think that an alternative
like this to EdgeNGram exists.

Thanks,
Shawn