You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2009/09/27 15:09:16 UTC

[jira] Created: (TIKA-286) HtmlParser calls characters() with post-body data before processing the terminating body element.

HtmlParser calls characters() with post-body data before processing the terminating body element.
-------------------------------------------------------------------------------------------------

                 Key: TIKA-286
                 URL: https://issues.apache.org/jira/browse/TIKA-286
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
            Reporter: Ken Krugler
            Priority: Minor


Using this example data:

{noformat}
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
       "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
	<meta http-equiv="content-type" content="text/html; charset=utf-8">
	<title>Untitled</title>
	<base href="http://newdomain.com">
</head>
<body>

<a href="link" target="_blank">link1</a>
<a href="http://domain.com/link" target="_blank">link2</a>

</body>
</html>
{noformat}

The handler's characters() method gets called with the following text

Untitled
\n\n
link1
\n
link2
\n\n
\n
\n

The first six calls make sense to me.

The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.

>From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-286) HtmlParser calls characters() with post-body data before processing the terminating body element.

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760026#action_12760026 ] 

Uwe Schindler edited comment on TIKA-286 at 9/27/09 6:29 AM:
-------------------------------------------------------------

I think this is a known "bug" (or feature?) in nekohtml which does the HTML parsing!

This caused by nekohtml to "fix" wrong html documents (e.g. the nekohtml parser also adds missing element end tags and so on). I think, it does this because the html element is not allowed to contain text data. Because of this it tries to fix by adding the text to the body element. This is exactly the way, also browsers fix bad html (e.g. html without body at all and so on). Please not that nekohtml also adds e.g. missing body elements or html elements (e.g. if the html only contains block tags normally only allowed inside the body element).

If you do not want this, post a bug report at nekohtml.

      was (Author: thetaphi):
    I think this is a known "bug" (or feature?) in nekohtml which does the HTML parsing!
  
> HtmlParser calls characters() with post-body data before processing the terminating body element.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-286
>                 URL: https://issues.apache.org/jira/browse/TIKA-286
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Using this example data:
> {noformat}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>        "http://www.w3.org/TR/html4/strict.dtd">
> <html lang="en">
> <head>
> 	<meta http-equiv="content-type" content="text/html; charset=utf-8">
> 	<title>Untitled</title>
> 	<base href="http://newdomain.com">
> </head>
> <body>
> <a href="link" target="_blank">link1</a>
> <a href="http://domain.com/link" target="_blank">link2</a>
> </body>
> </html>
> {noformat}
> The handler's characters() method gets called with the following text
> Untitled
> \n\n
> link1
> \n
> link2
> \n\n
> \n
> \n
> The first six calls make sense to me.
> The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.
> From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (TIKA-286) HtmlParser calls characters() with post-body data before processing the terminating body element.

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler closed TIKA-286.
----------------------------

    Resolution: Won't Fix

Thanks for the info, Uwe.

I filed https://sourceforge.net/tracker/?func=detail&aid=2868326&group_id=195122&atid=952178 against CyberNeko. Minor issue, and I can easily fix up my parser comparison code to ignore trailing returns/newlines.


> HtmlParser calls characters() with post-body data before processing the terminating body element.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-286
>                 URL: https://issues.apache.org/jira/browse/TIKA-286
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Using this example data:
> {noformat}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>        "http://www.w3.org/TR/html4/strict.dtd">
> <html lang="en">
> <head>
> 	<meta http-equiv="content-type" content="text/html; charset=utf-8">
> 	<title>Untitled</title>
> 	<base href="http://newdomain.com">
> </head>
> <body>
> <a href="link" target="_blank">link1</a>
> <a href="http://domain.com/link" target="_blank">link2</a>
> </body>
> </html>
> {noformat}
> The handler's characters() method gets called with the following text
> Untitled
> \n\n
> link1
> \n
> link2
> \n\n
> \n
> \n
> The first six calls make sense to me.
> The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.
> From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-286) HtmlParser calls characters() with post-body data before processing the terminating body element.

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760026#action_12760026 ] 

Uwe Schindler commented on TIKA-286:
------------------------------------

I think this is a known "bug" (or feature?) in nekohtml which does the HTML parsing!

> HtmlParser calls characters() with post-body data before processing the terminating body element.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-286
>                 URL: https://issues.apache.org/jira/browse/TIKA-286
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Using this example data:
> {noformat}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>        "http://www.w3.org/TR/html4/strict.dtd">
> <html lang="en">
> <head>
> 	<meta http-equiv="content-type" content="text/html; charset=utf-8">
> 	<title>Untitled</title>
> 	<base href="http://newdomain.com">
> </head>
> <body>
> <a href="link" target="_blank">link1</a>
> <a href="http://domain.com/link" target="_blank">link2</a>
> </body>
> </html>
> {noformat}
> The handler's characters() method gets called with the following text
> Untitled
> \n\n
> link1
> \n
> link2
> \n\n
> \n
> \n
> The first six calls make sense to me.
> The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.
> From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.