You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/02/16 10:34:27 UTC

[jira] Created: (TIKA-379) Attribute on html tag not represented in XHTML

Attribute on html tag not represented in XHTML 
-----------------------------------------------

                 Key: TIKA-379
                 URL: https://issues.apache.org/jira/browse/TIKA-379
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Julien Nioche


The following HTML document :

<html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>

is rendered as the following xhtml by Tika :

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>

with the lang attribute getting lost.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-379) Lang attribute on html tag skipped

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834228#action_12834228 ] 

Ken Krugler commented on TIKA-379:
----------------------------------

I think this is part of a bigger issue re attributes getting stripped. E.g. <a rel="nofollow> is important for web crawlers.

Since the language attribute can be applied to a variety of tags, I don't think it's an option to just store it in the metadata.


> Lang attribute on html tag skipped 
> -----------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856848#action_12856848 ] 

Julien Nioche commented on TIKA-379:
------------------------------------

thanks for your comments.
I had seen the HTMLMapper but as I pointed out 
{quote}
There is actually a special treatment for the elements in HEAD done in the class HtmlHandler so simply adding *link* to the HTMLMapper does not solve the problem.
{quote}
I will send a patch later today which modifies the HTMLMapper to make it generate LINK elements in the XHTML output. This is a reasonable thing to do as this entity is allowed in the XHTML DTD.
I will look at the HTMLMapper later to see how we could get it to keep the href attributes

 

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-379:
-------------------------------

    Attachment: TIKA-379-3.patch

Modified patch which fixes test errors. could anyone review it?
Thanks

Julien 

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>         Attachments: TIKA-379, TIKA-379-2.patch, TIKA-379-3.patch
>
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-379:
-------------------------------

    Attachment: TIKA-379

Adds the Base, Meta and Link elements found in the Head section to the XHTML output

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>         Attachments: TIKA-379
>
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860170#action_12860170 ] 

Jukka Zitting commented on TIKA-379:
------------------------------------

Re: second patch -  Seems like a good approach.

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>         Attachments: TIKA-379, TIKA-379-2.patch
>
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856794#action_12856794 ] 

Julien Nioche commented on TIKA-379:
------------------------------------

This is indeed a more generic problem. It also affects HTML elements like *link* which are commonly used in head sections to specify favicons or canonical representations. These values are not stored in the metadata  either and are vital for a crawler.

Is there a specific reason why these things are not rendered in the XHTML? I agree with Ken that it would be better not only to store information in the metadata but also to be able to retrieve them from the SAX events. 

Any thoughts on this?





> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Assigned: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-379:
--------------------------------------

    Assignee: Chris A. Mattmann

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>            Priority: Critical
>         Attachments: TIKA-379, TIKA-379-2.patch, TIKA-379-3.patch
>
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856815#action_12856815 ] 

Julien Nioche commented on TIKA-379:
------------------------------------

There is actually a special treatment for the elements in HEAD done in the class HtmlHandler so simply adding *link* to the HTMLMapper does not solve the problem.

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-379:
-------------------------------

              Summary: Html elements and attributes not available in XHTML representation   (was: Lang attribute on html tag skipped )
    Affects Version/s: 0.7
             Priority: Critical  (was: Major)

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856839#action_12856839 ] 

Jukka Zitting commented on TIKA-379:
------------------------------------

The reason for the default HTML mapping rules in Tika are to simplify and normalize the input documents so that client applications could easily process all sorts of input (HTML or not) without needing type- or source-specific heuristics. The basic idea has been that clients should directly use the underlying parser libraries when it needs custom processing of specific content types.

That said, I see the value of being able to process even complex HTML input through the Tika API, and perhaps the above original intent is too strict for many use cases. The HtmlMapper interface we added for TIKA-347 should make it possible to relax the mapping rules, and in revision 933909 I added a IdentityHtmlMapper implementation of this interface to make it even easier to use:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

Note that IdentityHtmlMapper breaks the guarantee that the Tika output is valid XHTML. Also, currently the HtmlMapper interface only covers elements, so all attributes are still lost and IdentityHtmlMapper overrides the custom <a/> tag handling in HtmlHandler so even the href attributes are gone. It would be good if we could extend the HtmlMapper mechanism to avoid these problems.

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (TIKA-379) Lang attribute on html tag skipped

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-379:
-------------------------------

    Description: 
The following HTML document :

<html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>

is rendered as the following xhtml by Tika :

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>

with the lang attribute getting lost. The lang is not stored in the metadata either.



  was:
The following HTML document :

<html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>

is rendered as the following xhtml by Tika :

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>

with the lang attribute getting lost.



        Summary: Lang attribute on html tag skipped   (was: Attribute on html tag not represented in XHTML )

> Lang attribute on html tag skipped 
> -----------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-379:
-------------------------------

    Attachment: TIKA-379-2.patch

Attached a second patch with a suggested solution for normalising/filtering incoming attribute names. the code compiles but the tests fail. The purpose is mostly to illustrate the idea before implementing it properly. 
 

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>         Attachments: TIKA-379, TIKA-379-2.patch
>
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Issue Comment Edited: (TIKA-379) Html elements and attributes not available in XHTML representation

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856794#action_12856794 ] 

Julien Nioche edited comment on TIKA-379 at 4/14/10 4:40 AM:
-------------------------------------------------------------

This is indeed a more generic problem. It also affects HTML elements like *link* which are commonly used in head sections to specify favicons or canonical representations. These values are not stored in the metadata  either and are vital for a crawler.

I agree with Ken that it would be better not only to store information in the metadata but also to be able to retrieve them from the SAX events. 

Looks like this is due to the filtering done in DefaultHTMLMapper which can be overriden in the Context so we could simply pass a less restrictive filter.  The default mapper is based on [http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd] which allows *link* elements within the *head* so we could add it to _mapSafeElement()_, however as there are no restrictions on the hierarchy this would mean that such elements would also be allowed within the *body*.

Any thoughts?





      was (Author: jnioche):
    This is indeed a more generic problem. It also affects HTML elements like *link* which are commonly used in head sections to specify favicons or canonical representations. These values are not stored in the metadata  either and are vital for a crawler.

Is there a specific reason why these things are not rendered in the XHTML? I agree with Ken that it would be better not only to store information in the metadata but also to be able to retrieve them from the SAX events. 

Any thoughts on this?




  
> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira