You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org> on 2015/03/04 11:37:05 UTC

[jira] [Comment Edited] (TIKA-1017) DefaultHtmlMapper misses some safe elements

    [ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346702#comment-14346702 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1017 at 3/4/15 10:36 AM:
----------------------------------------------------------------------------

If we want to preserve the semantics, maybe at least SUB and SUP should be added? For instance in a scientific document, "a<sup>b</sup>" and "a<sub>b</sub>" might be different concepts, which are lost if you only get "a b".

If we want to keep all "safe" elements, we could also add at least I, B, EM and STRONG.

It's easy enough to use a custom mapper, so this is not a huge issue, but a good default is always nice. Given the above, maybe only add SUB and SUP?



was (Author: dbr):
If we want to preserve the semantics, maybe at least SUB and SUP should be added? For instance in a scientific document, "a<sup>b</sup>" and "a<sub>b</sub>" might be different concepts, which are lost if you only get "a b".

If we want to keep all "safe" elements, we could also add at least I, B, EM and STRONG.

It's easy enough to use another mapper, so this is not a huge issue, but a good default is always nice. Given the above, maybe only add SUB and SUP?


> DefaultHtmlMapper misses some safe elements
> -------------------------------------------
>
>                 Key: TIKA-1017
>                 URL: https://issues.apache.org/jira/browse/TIKA-1017
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like <sub> and <i> are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)