You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gerard Bouchar (JIRA)" <ji...@apache.org> on 2018/07/31 13:22:00 UTC
[jira] [Created] (TIKA-2700) The HTML parser should parse the
contents of the title tag as raw text, not HTML
Gerard Bouchar created TIKA-2700:
------------------------------------
Summary: The HTML parser should parse the contents of the title tag as raw text, not HTML
Key: TIKA-2700
URL: https://issues.apache.org/jira/browse/TIKA-2700
Project: Tika
Issue Type: Bug
Reporter: Gerard Bouchar
Attachments: title.html
The current HTML parser in tika fails to extract the correct document title when it contains at least one unescaped '<' character.
For instance, in the following HTML document:
{code:html}
<html><title>title with a <b>tag</b> in it</title><body></body></html>
{code}
the extracted title is
{code}
title with a
{code}
Browsers however respect the [html parsing specification|https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inhead], and display this title as
{code}
title with a <b>tag</b> in it
{code}
(with a literal _<b>_)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)