You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Matthew Caruana Galizia (JIRA)" <ji...@apache.org> on 2016/03/08 15:28:40 UTC
[jira] [Created] (TIKA-1896) Invalid closing script tag not handled
gracefully by HtmlParser
Matthew Caruana Galizia created TIKA-1896:
---------------------------------------------
Summary: Invalid closing script tag not handled gracefully by HtmlParser
Key: TIKA-1896
URL: https://issues.apache.org/jira/browse/TIKA-1896
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.12
Reporter: Matthew Caruana Galizia
Attachments: test.html
When an HTML file contains an invalid closing script tag, all content after that tag is interpreted as script data and therefore ignored.
Reduced test case file attached.
To reproduce:
1) create a file with the following HTML
{code:html}
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<script lang="javascript"></script language>
</head>
<body>
<p>This is a test.</p>
</body>
</html>
{code}
2) {{java -jar tika-app-1.12.jar -t test.html}}
Expected result:
{{This is a test.}}
What is actually returned:
Nothing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)