You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org> on 2012/11/06 11:54:12 UTC
[jira] [Created] (TIKA-1017) DefaultHtmlMapper misses some safe
elements
Daniel Bonniot de Ruisselet created TIKA-1017:
-------------------------------------------------
Summary: DefaultHtmlMapper misses some safe elements
Key: TIKA-1017
URL: https://issues.apache.org/jira/browse/TIKA-1017
Project: Tika
Issue Type: Bug
Reporter: Daniel Bonniot de Ruisselet
The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
Elements like <sub> and <i> are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe
elements
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492342#comment-13492342 ]
Jukka Zitting commented on TIKA-1017:
-------------------------------------
The idea behind DefaultHtmlMapper is to try to normalize and simplify the incoming HTML as much as possible while still preserving the semantic structure of the document. We can add extra elements if there's a good use case that's not already covered by the IdentifyHtmlMapper class.
> DefaultHtmlMapper misses some safe elements
> -------------------------------------------
>
> Key: TIKA-1017
> URL: https://issues.apache.org/jira/browse/TIKA-1017
> Project: Tika
> Issue Type: Bug
> Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like <sub> and <i> are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe
elements
Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491492#comment-13491492 ]
Ken Krugler commented on TIKA-1017:
-----------------------------------
Hi Daniel - this sounds like a question for the mailing list. If, after discussion, it appears to be a bug then you'd file a Jira issue. Using the mailing list would also be the best way to get input from the author (I think that was Jukka).
> DefaultHtmlMapper misses some safe elements
> -------------------------------------------
>
> Key: TIKA-1017
> URL: https://issues.apache.org/jira/browse/TIKA-1017
> Project: Tika
> Issue Type: Bug
> Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like <sub> and <i> are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira