You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Cassandra Targett (JIRA)" <ji...@apache.org> on 2018/09/21 16:28:01 UTC

[jira] [Assigned] (SOLR-1394) HTML stripper is splitting tokens

     [ https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cassandra Targett reassigned SOLR-1394:
---------------------------------------

    Assignee: Cassandra Targett

> HTML stripper is splitting tokens
> ---------------------------------
>
>                 Key: SOLR-1394
>                 URL: https://issues.apache.org/jira/browse/SOLR-1394
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>            Assignee: Cassandra Targett
>            Priority: Major
>             Fix For: 1.4
>
>         Attachments: SOLR-1394.patch, SOLR-1394.patch, hex-entity.patch
>
>
> The Solr HTML stripper is replacing any removed HTML with whitespace. This is to keep offsets correct for highlighting.
> However, as was already pointed out in SOLR-42, this means that any token containing an HTML entity will be split into several tokens. That makes the HTML stripper completely unreliable for international text (and any text is potentially interantional).
> The current code is actually deficient for BOTH highlighting and indexing, where the previous incarnation (that did not insert spaces) only had problems with highlighting.
> The only workaround is to not use entities at all, which is impossible in some situations and inconvenient in most situations. If the client is required to transform entities before handing it to Solr, it might as well be required to also strip tags, and then the HTML stripper would not be needed at all.
> Today, we have a better solution that can be used: offset correction. We can then avoid inserting extra whitespace, but still get correct offsets. The attached patch implements just that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org