You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Gerard Bouchar (JIRA)" <ji...@apache.org> on 2018/07/31 13:22:00 UTC

[jira] [Created] (TIKA-2700) The HTML parser should parse the contents of the title tag as raw text, not HTML

Gerard Bouchar created TIKA-2700:
------------------------------------

             Summary: The HTML parser should parse the contents of the title tag as raw text, not HTML
                 Key: TIKA-2700
                 URL: https://issues.apache.org/jira/browse/TIKA-2700
             Project: Tika
          Issue Type: Bug
            Reporter: Gerard Bouchar
         Attachments: title.html

The current HTML parser in tika fails to extract the correct document title when it contains at least one unescaped '<' character.

 

For instance, in the following HTML document:

{code:html}
<html><title>title with a <b>tag</b> in it</title><body></body></html>
{code}

the extracted title is

{code}
title with a
{code}


Browsers however respect the [html parsing specification|https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inhead], and display this title as 

{code}
title with a <b>tag</b> in it
{code}

(with a literal _<b>_)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)