You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Matthew Caruana Galizia (JIRA)" <ji...@apache.org> on 2016/03/08 15:28:40 UTC

[jira] [Created] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

Matthew Caruana Galizia created TIKA-1896:
---------------------------------------------

             Summary: Invalid closing script tag not handled gracefully by HtmlParser
                 Key: TIKA-1896
                 URL: https://issues.apache.org/jira/browse/TIKA-1896
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.12
            Reporter: Matthew Caruana Galizia
         Attachments: test.html

When an HTML file contains an invalid closing script tag, all content after that tag is interpreted as script data and therefore ignored.

Reduced test case file attached.

To reproduce:

1) create a file with the following HTML

{code:html}
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
	<head>
		<script lang="javascript"></script language>
	</head>
	<body>
		<p>This is a test.</p>
	</body>
</html>
{code}

2) {{java -jar tika-app-1.12.jar -t test.html}}

Expected result:

{{This is a test.}}

What is actually returned:

Nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)