You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gerard Bouchar (JIRA)" <ji...@apache.org> on 2018/05/25 10:16:00 UTC

[jira] [Updated] (TIKA-2652) HtmlParser generates incorrect meta tags

     [ https://issues.apache.org/jira/browse/TIKA-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gerard Bouchar updated TIKA-2652:
---------------------------------
    Description: 
Whatever the input HTML meta are, tika's HTML meta can only have a "name" and a "content"  attribute. This gives invalid HTML meta tags for in the output.

For instance, the following valid HTML file

{code:html}
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Title</title>
    <meta http-equiv="refresh" content="0; url=http://example.com">
  </head>
  <body></body>
</html>
{code}

is transformed into a SAX stream corresponding to the following HTML :

{code:html}
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="dc:title" content="Title"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="refresh" content="0; url=http://example.com"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Title</title>
</head>
<body/></html>
{code}

(the redirection, content-type, and content-encoding are all specified in a non-standard way)

The information that the original file had an "http-equiv" meta tag is lost, and replaced by a generic "meta name=" tag.

This is annoying when working with classes expecting valid meta redirection, such as Nutch's [HTMLMetaProcessor|https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HTMLMetaProcessor.java], for instance.

  was:
Whatever the input HTML meta are, tika's HTML meta can only have a "name" and a "content"  attribute. This gives invalid HTML meta tags in the output.

For instance, the following valid HTML file

{code:html}
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Title</title>
    <meta http-equiv="refresh" content="0; url=http://example.com">
  </head>
  <body></body>
</html>
{code}

is transformed into a SAX stream corresponding to the following HTML :

{code:html}
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="dc:title" content="Title"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="refresh" content="0; url=http://example.com"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Title</title>
</head>
<body/></html>
{code}

The information that the original file had an "http-equiv" meta tag is lost, and replaced by a generic "meta name=" tag.

This is annoying when working with classes expecting valid meta redirection, such as Nutch's [HTMLMetaProcessot|https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HTMLMetaProcessor.java], for instance.


> HtmlParser generates incorrect meta tags
> ----------------------------------------
>
>                 Key: TIKA-2652
>                 URL: https://issues.apache.org/jira/browse/TIKA-2652
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> Whatever the input HTML meta are, tika's HTML meta can only have a "name" and a "content"  attribute. This gives invalid HTML meta tags for in the output.
> For instance, the following valid HTML file
> {code:html}
> <!DOCTYPE html>
> <html lang="en">
>   <head>
>     <title>Title</title>
>     <meta http-equiv="refresh" content="0; url=http://example.com">
>   </head>
>   <body></body>
> </html>
> {code}
> is transformed into a SAX stream corresponding to the following HTML :
> {code:html}
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="dc:title" content="Title"/>
> <meta name="Content-Encoding" content="ISO-8859-1"/>
> <meta name="refresh" content="0; url=http://example.com"/>
> <meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Title</title>
> </head>
> <body/></html>
> {code}
> (the redirection, content-type, and content-encoding are all specified in a non-standard way)
> The information that the original file had an "http-equiv" meta tag is lost, and replaced by a generic "meta name=" tag.
> This is annoying when working with classes expecting valid meta redirection, such as Nutch's [HTMLMetaProcessor|https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HTMLMetaProcessor.java], for instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)