You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/05/29 18:29:00 UTC
[jira] [Resolved] (TIKA-2100) Html Parser does not keep the html
tag attributes
[ https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2100.
-------------------------------
Resolution: Fixed
Thank you [~Gerard Bouchar]!
> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Gerard Bouchar
> Priority: Major
>
> Parsing a very simple html like
> <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html>
> you won't be able to access the html tag's attributes (here lang="en") in the ContentHandler :
> *in the method startElement(String ns, String localName, String name,
> Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the HtmlMapper.mapSafeAttribute method too.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)