You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org> on 2012/11/06 11:54:12 UTC

[jira] [Created] (TIKA-1017) DefaultHtmlMapper misses some safe elements

Daniel Bonniot de Ruisselet created TIKA-1017:
-------------------------------------------------

             Summary: DefaultHtmlMapper misses some safe elements
                 Key: TIKA-1017
                 URL: https://issues.apache.org/jira/browse/TIKA-1017
             Project: Tika
          Issue Type: Bug
            Reporter: Daniel Bonniot de Ruisselet


The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

Elements like <sub> and <i> are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe elements

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492342#comment-13492342 ] 

Jukka Zitting commented on TIKA-1017:
-------------------------------------

The idea behind DefaultHtmlMapper is to try to normalize and simplify the incoming HTML as much as possible while still preserving the semantic structure of the document. We can add extra elements if there's a good use case that's not already covered by the IdentifyHtmlMapper class.
                
> DefaultHtmlMapper misses some safe elements
> -------------------------------------------
>
>                 Key: TIKA-1017
>                 URL: https://issues.apache.org/jira/browse/TIKA-1017
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like <sub> and <i> are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe elements

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491492#comment-13491492 ] 

Ken Krugler commented on TIKA-1017:
-----------------------------------

Hi Daniel - this sounds like a question for the mailing list. If, after discussion, it appears to be a bug then you'd file a Jira issue. Using the mailing list would also be the best way to get input from the author (I think that was Jukka).
                
> DefaultHtmlMapper misses some safe elements
> -------------------------------------------
>
>                 Key: TIKA-1017
>                 URL: https://issues.apache.org/jira/browse/TIKA-1017
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like <sub> and <i> are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira