You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Neumann, Dennis" <ne...@sub.uni-goettingen.de> on 2016/09/05 16:00:09 UTC

Wrong highlighting in stripped HTML field

Hi guys

I am having a problem with the standard highlighter. I'm working with Solr 5.4.1. The problem appears in my project, but it is easy to replicate:

I create a new core with the conf directory from configsets/basic_configs, so everything is set to defaults. I add the following in schema.xml:


    <field name="testfield" type="mytype" indexed="true" stored="true" required="false" multiValued="false" />

    <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
    </fieldType>


Now I add this document (in the admin interface):

{"id":"1","testfield":"<span>bla</span>"}

I search for: testfield:bla
with hl=on&hl.fl=testfield

What I get is a response with an incorrectly formatted HTML snippet:


  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "testfield": "<span>bla</span>",
        "_version_": 1544645963570741200
      }
    ]
  },
  "highlighting": {
    "1": {
      "testfield": [
        "<span><em>bla</span></em>"
      ]
    }
  }

Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want to get

<span><em>bla</em></span>


Best regards
Dennis


AW: Wrong highlighting in stripped HTML field

Posted by "Neumann, Dennis" <ne...@sub.uni-goettingen.de>.
Hello,
thank you very much for your answers. As described in the SOLR-4686 issue, the problem only occurs when you use inline HTML tags (like <a> or <span>). So in my case the solution is actually to use a block element and force it to be inline:

<div style="display: inline">bla</div>

highlighting:

<div style="display: inline"><em>bla</em></div>

Cheers and thanks again,
Dennis


________________________________________
Von: Alan Woodward [alan@flax.co.uk]
Gesendet: Donnerstag, 8. September 2016 12:48
An: solr-user@lucene.apache.org
Betreff: Re: Wrong highlighting in stripped HTML field

Hi, see https://issues.apache.org/jira/browse/SOLR-4686 <https://issues.apache.org/jira/browse/SOLR-4686> - this is an ongoing point of contention!

Alan Woodward
www.flax.co.uk


> On 8 Sep 2016, at 09:38, Duck Geraint (ext) GBJH <ge...@syngenta.com> wrote:
>
> As far as I can tell, that is how it's currently set-up (does the same on mine at least). The HTML Stripper seems to exclude the pre tag, but include the post tag when it generates the start and end offsets of each text token. I couldn't say why though... (This may just have avoided needing to backtrack).
>
> Play around in the analysis section of the admin ui to verify this.
>
> Geraint
>
>
> -----Original Message-----
> From: Neumann, Dennis [mailto:neumann@sub.uni-goettingen.de]
> Sent: 07 September 2016 18:16
> To: solr-user@lucene.apache.org
> Subject: AW: Wrong highlighting in stripped HTML field
>
> Hello,
> can anyone confirm this behavior of the highlighter? Otherwise my Solr installation might be misconfigured or something.
> Or does anyone know if this is a known issue? In that case I probably should ask on the dev mailing list.
>
> Thanks and cheers,
> Dennis
>
>
> ________________________________________
> Von: Neumann, Dennis [neumann@sub.uni-goettingen.de]
> Gesendet: Montag, 5. September 2016 18:00
> An: solr-user@lucene.apache.org
> Betreff: Wrong highlighting in stripped HTML field
>
> Hi guys
>
> I am having a problem with the standard highlighter. I'm working with Solr 5.4.1. The problem appears in my project, but it is easy to replicate:
>
> I create a new core with the conf directory from configsets/basic_configs, so everything is set to defaults. I add the following in schema.xml:
>
>
>    <field name="testfield" type="mytype" indexed="true" stored="true" required="false" multiValued="false" />
>
>    <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <charFilter class="solr.HTMLStripCharFilterFactory" />
>        <tokenizer class="solr.StandardTokenizerFactory" />
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory" />
>      </analyzer>
>    </fieldType>
>
>
> Now I add this document (in the admin interface):
>
> {"id":"1","testfield":"<span>bla</span>"}
>
> I search for: testfield:bla
> with hl=on&hl.fl=testfield
>
> What I get is a response with an incorrectly formatted HTML snippet:
>
>
>  "response": {
>    "numFound": 1,
>    "start": 0,
>    "docs": [
>      {
>        "id": "1",
>        "testfield": "<span>bla</span>",
>        "_version_": 1544645963570741200
>      }
>    ]
>  },
>  "highlighting": {
>    "1": {
>      "testfield": [
>        "<span><em>bla</span></em>"
>      ]
>    }
>  }
>
> Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want to get
>
> <span><em>bla</em></span>
>
>
> Best regards
> Dennis
>
>
> ________________________________
>
>
> Syngenta Limited, Registered in England No 2710846; Registered Office : Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, RG42 6EY, United Kingdom
> ________________________________
> This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.


Re: Wrong highlighting in stripped HTML field

Posted by Alan Woodward <al...@flax.co.uk>.
Hi, see https://issues.apache.org/jira/browse/SOLR-4686 <https://issues.apache.org/jira/browse/SOLR-4686> - this is an ongoing point of contention!

Alan Woodward
www.flax.co.uk


> On 8 Sep 2016, at 09:38, Duck Geraint (ext) GBJH <ge...@syngenta.com> wrote:
> 
> As far as I can tell, that is how it's currently set-up (does the same on mine at least). The HTML Stripper seems to exclude the pre tag, but include the post tag when it generates the start and end offsets of each text token. I couldn't say why though... (This may just have avoided needing to backtrack).
> 
> Play around in the analysis section of the admin ui to verify this.
> 
> Geraint
> 
> 
> -----Original Message-----
> From: Neumann, Dennis [mailto:neumann@sub.uni-goettingen.de]
> Sent: 07 September 2016 18:16
> To: solr-user@lucene.apache.org
> Subject: AW: Wrong highlighting in stripped HTML field
> 
> Hello,
> can anyone confirm this behavior of the highlighter? Otherwise my Solr installation might be misconfigured or something.
> Or does anyone know if this is a known issue? In that case I probably should ask on the dev mailing list.
> 
> Thanks and cheers,
> Dennis
> 
> 
> ________________________________________
> Von: Neumann, Dennis [neumann@sub.uni-goettingen.de]
> Gesendet: Montag, 5. September 2016 18:00
> An: solr-user@lucene.apache.org
> Betreff: Wrong highlighting in stripped HTML field
> 
> Hi guys
> 
> I am having a problem with the standard highlighter. I'm working with Solr 5.4.1. The problem appears in my project, but it is easy to replicate:
> 
> I create a new core with the conf directory from configsets/basic_configs, so everything is set to defaults. I add the following in schema.xml:
> 
> 
>    <field name="testfield" type="mytype" indexed="true" stored="true" required="false" multiValued="false" />
> 
>    <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <charFilter class="solr.HTMLStripCharFilterFactory" />
>        <tokenizer class="solr.StandardTokenizerFactory" />
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory" />
>      </analyzer>
>    </fieldType>
> 
> 
> Now I add this document (in the admin interface):
> 
> {"id":"1","testfield":"<span>bla</span>"}
> 
> I search for: testfield:bla
> with hl=on&hl.fl=testfield
> 
> What I get is a response with an incorrectly formatted HTML snippet:
> 
> 
>  "response": {
>    "numFound": 1,
>    "start": 0,
>    "docs": [
>      {
>        "id": "1",
>        "testfield": "<span>bla</span>",
>        "_version_": 1544645963570741200
>      }
>    ]
>  },
>  "highlighting": {
>    "1": {
>      "testfield": [
>        "<span><em>bla</span></em>"
>      ]
>    }
>  }
> 
> Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want to get
> 
> <span><em>bla</em></span>
> 
> 
> Best regards
> Dennis
> 
> 
> ________________________________
> 
> 
> Syngenta Limited, Registered in England No 2710846; Registered Office : Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, RG42 6EY, United Kingdom
> ________________________________
> This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.


RE: Wrong highlighting in stripped HTML field

Posted by "Duck Geraint (ext) GBJH" <ge...@syngenta.com>.
As far as I can tell, that is how it's currently set-up (does the same on mine at least). The HTML Stripper seems to exclude the pre tag, but include the post tag when it generates the start and end offsets of each text token. I couldn't say why though... (This may just have avoided needing to backtrack).

Play around in the analysis section of the admin ui to verify this.

Geraint


-----Original Message-----
From: Neumann, Dennis [mailto:neumann@sub.uni-goettingen.de]
Sent: 07 September 2016 18:16
To: solr-user@lucene.apache.org
Subject: AW: Wrong highlighting in stripped HTML field

Hello,
can anyone confirm this behavior of the highlighter? Otherwise my Solr installation might be misconfigured or something.
Or does anyone know if this is a known issue? In that case I probably should ask on the dev mailing list.

Thanks and cheers,
Dennis


________________________________________
Von: Neumann, Dennis [neumann@sub.uni-goettingen.de]
Gesendet: Montag, 5. September 2016 18:00
An: solr-user@lucene.apache.org
Betreff: Wrong highlighting in stripped HTML field

Hi guys

I am having a problem with the standard highlighter. I'm working with Solr 5.4.1. The problem appears in my project, but it is easy to replicate:

I create a new core with the conf directory from configsets/basic_configs, so everything is set to defaults. I add the following in schema.xml:


    <field name="testfield" type="mytype" indexed="true" stored="true" required="false" multiValued="false" />

    <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
    </fieldType>


Now I add this document (in the admin interface):

{"id":"1","testfield":"<span>bla</span>"}

I search for: testfield:bla
with hl=on&hl.fl=testfield

What I get is a response with an incorrectly formatted HTML snippet:


  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "testfield": "<span>bla</span>",
        "_version_": 1544645963570741200
      }
    ]
  },
  "highlighting": {
    "1": {
      "testfield": [
        "<span><em>bla</span></em>"
      ]
    }
  }

Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want to get

<span><em>bla</em></span>


Best regards
Dennis


________________________________


Syngenta Limited, Registered in England No 2710846; Registered Office : Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, RG42 6EY, United Kingdom
________________________________
 This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.

AW: Wrong highlighting in stripped HTML field

Posted by "Neumann, Dennis" <ne...@sub.uni-goettingen.de>.
Hello,
can anyone confirm this behavior of the highlighter? Otherwise my Solr installation might be misconfigured or something.
Or does anyone know if this is a known issue? In that case I probably should ask on the dev mailing list.

Thanks and cheers,
Dennis


________________________________________
Von: Neumann, Dennis [neumann@sub.uni-goettingen.de]
Gesendet: Montag, 5. September 2016 18:00
An: solr-user@lucene.apache.org
Betreff: Wrong highlighting in stripped HTML field

Hi guys

I am having a problem with the standard highlighter. I'm working with Solr 5.4.1. The problem appears in my project, but it is easy to replicate:

I create a new core with the conf directory from configsets/basic_configs, so everything is set to defaults. I add the following in schema.xml:


    <field name="testfield" type="mytype" indexed="true" stored="true" required="false" multiValued="false" />

    <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
    </fieldType>


Now I add this document (in the admin interface):

{"id":"1","testfield":"<span>bla</span>"}

I search for: testfield:bla
with hl=on&hl.fl=testfield

What I get is a response with an incorrectly formatted HTML snippet:


  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "testfield": "<span>bla</span>",
        "_version_": 1544645963570741200
      }
    ]
  },
  "highlighting": {
    "1": {
      "testfield": [
        "<span><em>bla</span></em>"
      ]
    }
  }

Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want to get

<span><em>bla</em></span>


Best regards
Dennis