You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2009/09/27 21:32:16 UTC

[jira] Closed: (TIKA-286) HtmlParser calls characters() with post-body data before processing the terminating body element.

     [ https://issues.apache.org/jira/browse/TIKA-286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler closed TIKA-286.
----------------------------

    Resolution: Won't Fix

Thanks for the info, Uwe.

I filed https://sourceforge.net/tracker/?func=detail&aid=2868326&group_id=195122&atid=952178 against CyberNeko. Minor issue, and I can easily fix up my parser comparison code to ignore trailing returns/newlines.


> HtmlParser calls characters() with post-body data before processing the terminating body element.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-286
>                 URL: https://issues.apache.org/jira/browse/TIKA-286
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Using this example data:
> {noformat}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>        "http://www.w3.org/TR/html4/strict.dtd">
> <html lang="en">
> <head>
> 	<meta http-equiv="content-type" content="text/html; charset=utf-8">
> 	<title>Untitled</title>
> 	<base href="http://newdomain.com">
> </head>
> <body>
> <a href="link" target="_blank">link1</a>
> <a href="http://domain.com/link" target="_blank">link2</a>
> </body>
> </html>
> {noformat}
> The handler's characters() method gets called with the following text
> Untitled
> \n\n
> link1
> \n
> link2
> \n\n
> \n
> \n
> The first six calls make sense to me.
> The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.
> From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.