You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/05/29 18:29:00 UTC

[jira] [Resolved] (TIKA-2100) Html Parser does not keep the html tag attributes

     [ https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2100.
-------------------------------
    Resolution: Fixed

Thank you [~Gerard Bouchar]!

> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>
>                 Key: TIKA-2100
>                 URL: https://issues.apache.org/jira/browse/TIKA-2100
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> Parsing a very simple html like 
>  <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html> 
> you won't be able to access the html tag's attributes (here lang="en") in the ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>       Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)