You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gerard Bouchar (JIRA)" <ji...@apache.org> on 2018/08/17 12:16:00 UTC

[jira] [Created] (TIKA-2709) Invalid handling of tags

Gerard Bouchar created TIKA-2709:
------------------------------------

             Summary: Invalid handling of <base> tags
                 Key: TIKA-2709
                 URL: https://issues.apache.org/jira/browse/TIKA-2709
             Project: Tika
          Issue Type: Bug
            Reporter: Gerard Bouchar


Currently, when the HTML parser encounters the following:

{code:html}
<base href='http://example.com/'>
{code}

it emits SAX events corresponding to the following:

{code:html}
<base />
<meta name='Content-Location' value='http://example.com/' />
{code}

Remark that the "base" tag has no attribute, which is [not valid in HTML|https://html.spec.whatwg.org/multipage/semantics.html#the-base-element].

Moreover the [Content-Location HTTP header|https://tools.ietf.org/html/rfc7231#section-3.1.4.2] has a different meaning, and the behavior of tika doesn't allow application code to distinguish between a base tag and an http-equiv meta-tag setting the Content-Location.
 
See: https://github.com/apache/tika/blob/18f4e24451b1d835ab1897f49389788f78063a52/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java#L158-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)