You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2009/09/27 15:27:16 UTC

[jira] Commented: (TIKA-286) HtmlParser calls characters() with post-body data before processing the terminating body element.

    [ https://issues.apache.org/jira/browse/TIKA-286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760026#action_12760026 ] 

Uwe Schindler commented on TIKA-286:
------------------------------------

I think this is a known "bug" (or feature?) in nekohtml which does the HTML parsing!

> HtmlParser calls characters() with post-body data before processing the terminating body element.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-286
>                 URL: https://issues.apache.org/jira/browse/TIKA-286
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Using this example data:
> {noformat}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>        "http://www.w3.org/TR/html4/strict.dtd">
> <html lang="en">
> <head>
> 	<meta http-equiv="content-type" content="text/html; charset=utf-8">
> 	<title>Untitled</title>
> 	<base href="http://newdomain.com">
> </head>
> <body>
> <a href="link" target="_blank">link1</a>
> <a href="http://domain.com/link" target="_blank">link2</a>
> </body>
> </html>
> {noformat}
> The handler's characters() method gets called with the following text
> Untitled
> \n\n
> link1
> \n
> link2
> \n\n
> \n
> \n
> The first six calls make sense to me.
> The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.
> From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.