You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/07/07 17:28:50 UTC

[jira] Created: (TIKA-457) HTMLParser gets an early event

HTMLParser gets an early </body> event
--------------------------------------

                 Key: TIKA-457
                 URL: https://issues.apache.org/jira/browse/TIKA-457
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Julien Nioche


I am using the IdentityMapper in the HTMLparser with this simple document:

{code}
<html><head><title> my title </title>
</head>
<body>
<frameset rows=\"20,*\"> 
<frame src=\"top.html\">
</frame>
<frameset cols=\"20,*\">
<frame src=\"left.html\">
</frame>
<frame src=\"invalid.html\"/>
</frame>
<frame src=\"right.html\">
</frame>
</frameset>
</frameset>
</body></html>
{code}

Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE*  we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler.

Any idea?




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-457) HTMLParser gets an early event

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899020#action_12899020 ] 

Ken Krugler commented on TIKA-457:
----------------------------------

Just applied patch (SVN 986089) to problem that showed up during testing on larger dataset. Empty value in Metadata was getting emitted as <meta> tag with empty content=xxx attribute, which can cause SAX processing code to throw  a NPE.


> HTMLParser gets an early </body> event
> --------------------------------------
>
>                 Key: TIKA-457
>                 URL: https://issues.apache.org/jira/browse/TIKA-457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>         Attachments: TIKA-457.patch
>
>
> I am using the IdentityMapper in the HTMLparser with this simple document:
> {code}
> <html><head><title> my title </title>
> </head>
> <body>
> <frameset rows=\"20,*\"> 
> <frame src=\"top.html\">
> </frame>
> <frameset cols=\"20,*\">
> <frame src=\"left.html\">
> </frame>
> <frame src=\"invalid.html\"/>
> </frame>
> <frame src=\"right.html\">
> </frame>
> </frameset>
> </frameset>
> </body></html>
> {code}
> Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE*  we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler.
> Any idea?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-457) HTMLParser gets an early event

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886061#action_12886061 ] 

Ken Krugler commented on TIKA-457:
----------------------------------

It's TagSoup that's generating the "interesting" output. Straight from a TagSoup parser (without Tika), the above gives you:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<html><head><title> my title </title></head><body/><frameset rows="20,*"><frame frameborder="1" scrolling="auto" src="top.html"/><frameset cols="20,*"><frame frameborder="1" scrolling="auto" src="left.html"/><frame frameborder="1" scrolling="auto" src="invalid.html"/><frame frameborder="1" scrolling="auto" src="right.html"/></frameset></frameset></html>
{code}

According to the XHTML 1.0 "frameset" DTD and the HTML 4.01 "frameset" DTD, the <frameset> element should NOT be inside of a body tag, which is why you're seeing the odd output.

I believe the issue here is that based on TagSoup's state machine architecture, the <body> tag has been emitted by the time you get to the <frameset>. TagSoup could hang onto the <body> tag until it sees something other than a <frameset>, but that feels pretty extreme.

Side note - the HTML is slightly broken, in that <frame src=\"invalid.html\"/> is followed by </frame>, but it's already been terminated by the "/>" sequence. Don't know if that was intentional or not.

Also strictly speaking you can't have empty <frame> elements, which is what are defined in the example. They should be <frame src="blah"> without a </frame>.



> HTMLParser gets an early </body> event
> --------------------------------------
>
>                 Key: TIKA-457
>                 URL: https://issues.apache.org/jira/browse/TIKA-457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>
> I am using the IdentityMapper in the HTMLparser with this simple document:
> {code}
> <html><head><title> my title </title>
> </head>
> <body>
> <frameset rows=\"20,*\"> 
> <frame src=\"top.html\">
> </frame>
> <frameset cols=\"20,*\">
> <frame src=\"left.html\">
> </frame>
> <frame src=\"invalid.html\"/>
> </frame>
> <frame src=\"right.html\">
> </frame>
> </frameset>
> </frameset>
> </body></html>
> {code}
> Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE*  we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler.
> Any idea?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (TIKA-457) HTMLParser gets an early event

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-457:
--------------------------------

    Assignee: Ken Krugler

> HTMLParser gets an early </body> event
> --------------------------------------
>
>                 Key: TIKA-457
>                 URL: https://issues.apache.org/jira/browse/TIKA-457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Ken Krugler
>
> I am using the IdentityMapper in the HTMLparser with this simple document:
> {code}
> <html><head><title> my title </title>
> </head>
> <body>
> <frameset rows=\"20,*\"> 
> <frame src=\"top.html\">
> </frame>
> <frameset cols=\"20,*\">
> <frame src=\"left.html\">
> </frame>
> <frame src=\"invalid.html\"/>
> </frame>
> <frame src=\"right.html\">
> </frame>
> </frameset>
> </frameset>
> </body></html>
> {code}
> Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE*  we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler.
> Any idea?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-457) HTMLParser gets an early event

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-457:
-----------------------------

    Attachment: TIKA-457.patch

This also improves handling of <frame> elements for [TIKA-463], by resolving relative URLs in src=xxx attributes for these elements.

> HTMLParser gets an early </body> event
> --------------------------------------
>
>                 Key: TIKA-457
>                 URL: https://issues.apache.org/jira/browse/TIKA-457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Ken Krugler
>         Attachments: TIKA-457.patch
>
>
> I am using the IdentityMapper in the HTMLparser with this simple document:
> {code}
> <html><head><title> my title </title>
> </head>
> <body>
> <frameset rows=\"20,*\"> 
> <frame src=\"top.html\">
> </frame>
> <frameset cols=\"20,*\">
> <frame src=\"left.html\">
> </frame>
> <frame src=\"invalid.html\"/>
> </frame>
> <frame src=\"right.html\">
> </frame>
> </frameset>
> </frameset>
> </body></html>
> {code}
> Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE*  we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler.
> Any idea?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-457) HTMLParser gets an early event

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler resolved TIKA-457.
------------------------------

    Fix Version/s: 0.8
       Resolution: Fixed

SVN 985288

> HTMLParser gets an early </body> event
> --------------------------------------
>
>                 Key: TIKA-457
>                 URL: https://issues.apache.org/jira/browse/TIKA-457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>         Attachments: TIKA-457.patch
>
>
> I am using the IdentityMapper in the HTMLparser with this simple document:
> {code}
> <html><head><title> my title </title>
> </head>
> <body>
> <frameset rows=\"20,*\"> 
> <frame src=\"top.html\">
> </frame>
> <frameset cols=\"20,*\">
> <frame src=\"left.html\">
> </frame>
> <frame src=\"invalid.html\"/>
> </frame>
> <frame src=\"right.html\">
> </frame>
> </frameset>
> </frameset>
> </body></html>
> {code}
> Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE*  we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler.
> Any idea?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.