event
HTMLParser gets an early </body> event -------------------------------------- Key: TIKA-457 URL: https://issues.apache.org/jira/browse/TIKA-457 Project: Tika Issue Type: Bug Components: parser Reporter: Julien Nioche I am using the IdentityMapper in the HTMLparser with this simple document: {code} <html><head><title> my title </title> </head> <body> <frameset rows=\"20,*\"> <frame src=\"top.html\"> </frame> <frameset cols=\"20,*\"> <frame src=\"left.html\"> </frame> <frame src=\"invalid.html\"/> </frame> <frame src=\"right.html\"> </frame> </frameset> </frameset> </body></html> {code} Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE* we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler. Any idea? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899020#action_12899020 ] Ken Krugler commented on TIKA-457: ---------------------------------- Just applied patch (SVN 986089) to problem that showed up during testing on larger dataset. Empty value in Metadata was getting emitted as <meta> tag with empty content=xxx attribute, which can cause SAX processing code to throw a NPE. > HTMLParser gets an early </body> event > -------------------------------------- > > Key: TIKA-457 > URL: https://issues.apache.org/jira/browse/TIKA-457 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Julien Nioche > Assignee: Ken Krugler > Fix For: 0.8 > > Attachments: TIKA-457.patch > > > I am using the IdentityMapper in the HTMLparser with this simple document: > {code} > <html><head><title> my title </title> > </head> > <body> > <frameset rows=\"20,*\"> > <frame src=\"top.html\"> > </frame> > <frameset cols=\"20,*\"> > <frame src=\"left.html\"> > </frame> > <frame src=\"invalid.html\"/> > </frame> > <frame src=\"right.html\"> > </frame> > </frameset> > </frameset> > </body></html> > {code} > Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE* we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler. > Any idea? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886061#action_12886061 ] Ken Krugler commented on TIKA-457: ---------------------------------- It's TagSoup that's generating the "interesting" output. Straight from a TagSoup parser (without Tika), the above gives you: {code} <?xml version="1.0" encoding="UTF-8"?> <html><head><title> my title </title></head><body/><frameset rows="20,*"><frame frameborder="1" scrolling="auto" src="top.html"/><frameset cols="20,*"><frame frameborder="1" scrolling="auto" src="left.html"/><frame frameborder="1" scrolling="auto" src="invalid.html"/><frame frameborder="1" scrolling="auto" src="right.html"/></frameset></frameset></html> {code} According to the XHTML 1.0 "frameset" DTD and the HTML 4.01 "frameset" DTD, the <frameset> element should NOT be inside of a body tag, which is why you're seeing the odd output. I believe the issue here is that based on TagSoup's state machine architecture, the <body> tag has been emitted by the time you get to the <frameset>. TagSoup could hang onto the <body> tag until it sees something other than a <frameset>, but that feels pretty extreme. Side note - the HTML is slightly broken, in that <frame src=\"invalid.html\"/> is followed by </frame>, but it's already been terminated by the "/>" sequence. Don't know if that was intentional or not. Also strictly speaking you can't have empty <frame> elements, which is what are defined in the example. They should be <frame src="blah"> without a </frame>. > HTMLParser gets an early </body> event > -------------------------------------- > > Key: TIKA-457 > URL: https://issues.apache.org/jira/browse/TIKA-457 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Julien Nioche > > I am using the IdentityMapper in the HTMLparser with this simple document: > {code} > <html><head><title> my title </title> > </head> > <body> > <frameset rows=\"20,*\"> > <frame src=\"top.html\"> > </frame> > <frameset cols=\"20,*\"> > <frame src=\"left.html\"> > </frame> > <frame src=\"invalid.html\"/> > </frame> > <frame src=\"right.html\"> > </frame> > </frameset> > </frameset> > </body></html> > {code} > Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE* we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler. > Any idea? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-457: -------------------------------- Assignee: Ken Krugler > HTMLParser gets an early </body> event > -------------------------------------- > > Key: TIKA-457 > URL: https://issues.apache.org/jira/browse/TIKA-457 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Julien Nioche > Assignee: Ken Krugler > > I am using the IdentityMapper in the HTMLparser with this simple document: > {code} > <html><head><title> my title </title> > </head> > <body> > <frameset rows=\"20,*\"> > <frame src=\"top.html\"> > </frame> > <frameset cols=\"20,*\"> > <frame src=\"left.html\"> > </frame> > <frame src=\"invalid.html\"/> > </frame> > <frame src=\"right.html\"> > </frame> > </frameset> > </frameset> > </body></html> > {code} > Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE* we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler. > Any idea? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-457: ----------------------------- Attachment: TIKA-457.patch This also improves handling of <frame> elements for [TIKA-463], by resolving relative URLs in src=xxx attributes for these elements. > HTMLParser gets an early </body> event > -------------------------------------- > > Key: TIKA-457 > URL: https://issues.apache.org/jira/browse/TIKA-457 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Julien Nioche > Assignee: Ken Krugler > Attachments: TIKA-457.patch > > > I am using the IdentityMapper in the HTMLparser with this simple document: > {code} > <html><head><title> my title </title> > </head> > <body> > <frameset rows=\"20,*\"> > <frame src=\"top.html\"> > </frame> > <frameset cols=\"20,*\"> > <frame src=\"left.html\"> > </frame> > <frame src=\"invalid.html\"/> > </frame> > <frame src=\"right.html\"> > </frame> > </frameset> > </frameset> > </body></html> > {code} > Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE* we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler. > Any idea? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-457. ------------------------------ Fix Version/s: 0.8 Resolution: Fixed SVN 985288 > HTMLParser gets an early </body> event > -------------------------------------- > > Key: TIKA-457 > URL: https://issues.apache.org/jira/browse/TIKA-457 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Julien Nioche > Assignee: Ken Krugler > Fix For: 0.8 > > Attachments: TIKA-457.patch > > > I am using the IdentityMapper in the HTMLparser with this simple document: > {code} > <html><head><title> my title </title> > </head> > <body> > <frameset rows=\"20,*\"> > <frame src=\"top.html\"> > </frame> > <frameset cols=\"20,*\"> > <frame src=\"left.html\"> > </frame> > <frame src=\"invalid.html\"/> > </frame> > <frame src=\"right.html\"> > </frame> > </frameset> > </frameset> > </body></html> > {code} > Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE* we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler. > Any idea? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.