You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/04/14 10:40:50 UTC

[jira] Issue Comment Edited: (TIKA-379) Html elements and attributes not available in XHTML representation

    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856794#action_12856794 ] 

Julien Nioche edited comment on TIKA-379 at 4/14/10 4:40 AM:
-------------------------------------------------------------

This is indeed a more generic problem. It also affects HTML elements like *link* which are commonly used in head sections to specify favicons or canonical representations. These values are not stored in the metadata  either and are vital for a crawler.

I agree with Ken that it would be better not only to store information in the metadata but also to be able to retrieve them from the SAX events. 

Looks like this is due to the filtering done in DefaultHTMLMapper which can be overriden in the Context so we could simply pass a less restrictive filter.  The default mapper is based on [http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd] which allows *link* elements within the *head* so we could add it to _mapSafeElement()_, however as there are no restrictions on the hierarchy this would mean that such elements would also be allowed within the *body*.

Any thoughts?





      was (Author: jnioche):
    This is indeed a more generic problem. It also affects HTML elements like *link* which are commonly used in head sections to specify favicons or canonical representations. These values are not stored in the metadata  either and are vital for a crawler.

Is there a specific reason why these things are not rendered in the XHTML? I agree with Ken that it would be better not only to store information in the metadata but also to be able to retrieve them from the SAX events. 

Any thoughts on this?




  
> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira