You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2013/04/11 05:55:16 UTC

[jira] [Comment Edited] (SOLR-4686) HTMLStripCharFilter and Highlighter generates invalid HTML

    [ https://issues.apache.org/jira/browse/SOLR-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628645#comment-13628645 ] 

Steve Rowe edited comment on SOLR-4686 at 4/11/13 3:54 AM:
-----------------------------------------------------------

bq. Are you think, the highlighter/formatter has a problem, or the offsets of the HTMLStripCharFilter are the problem? 

The existing HTML formatters try to insert start and end tags without being aware of the structure into which they're inserting, and this is a problem when the existing intervening markup is not balanced.

As I mentioned in my previous comment, I think HTMLStripCharFilter could behave differently with end tags and improve output for your example, but I can think of examples where the current behavior works and changing it would make it worse, e.g. highlighting the phrase "xxx yyy", where the original markup is 'xxx <b>yyy</b>', which currently works well: '<em>xxx <b>yyy</b></em>', but would be imbalanced if end tag offsets were changed in the way I suggested: '<em>xxx <b>yyy</em></b>'.  So on balance, I'm disinclined to make any changes.

bq. In my case, I use HTMLStripCharFilter to normalize XML-Input, therefor I would be happy about a switch "do not treat inline elements".

Have you seen the XmlCharFilter on SOLR-2597 ?

                
      was (Author: steve_rowe):
    bq. Are you think, the highlighter/formatter has a problem, or the offsets of the HTMLStripCharFilter are the problem? 

The existing HTML formatters try to insert start and end tags without being aware of the structure into which they're inserting, and this is a problem when the existing intervening markup is not balanced.

As I mentioned in my previous comment, I think HTMLStripCharFilter could behave differently with end tags and improve output for your example, but I can think of examples where the current behavior works and changing it would make it works, e.g. highlighting the phrase "xxx yyy", where the original markup is 'xxx <b>yyy</b>', which currently works well: '<em>xxx <b>yyy</b></em>', but would be imbalanced if end tag offsets were changed in the way I suggested: '<em>xxx <b>yyy</em></b>'.  So on balance, I'm disinclined to make any changes.

bq. In my case, I use HTMLStripCharFilter to normalize XML-Input, therefor I would be happy about a switch "do not treat inline elements".

Have you seen the XmlCharFilter on SOLR-2597 ?

                  
> HTMLStripCharFilter and Highlighter generates invalid HTML
> ----------------------------------------------------------
>
>                 Key: SOLR-4686
>                 URL: https://issues.apache.org/jira/browse/SOLR-4686
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 4.1
>            Reporter: Holger Floerke
>              Labels: HTML, highlighter
>
> Using the HTMLStripCharFilter may yield to an invalid HTML highlight.
> The HTMLStripCharFilter has a special treatment of inline-elements (eg. "a", "b", ...). For theese elements the CharFilter ignores the tag and does not insert any split-character.
> If you index
> """
> <a>xxx</a>
> """
> you get the word "xxx" starting at position 3 ending on position 10(!) 
> If you highlight a search on "xxx", you will get
> """
> <a><em>xxx</a></em>
> """
> which is invalid HTML.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org