You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Gerard Bouchar (JIRA)" <ji...@apache.org> on 2018/05/29 16:01:00 UTC
[jira] [Created] (NUTCH-2589) HTML redirections are not followed
when using parse-tika
Gerard Bouchar created NUTCH-2589:
-------------------------------------
Summary: HTML redirections are not followed when using parse-tika
Key: NUTCH-2589
URL: https://issues.apache.org/jira/browse/NUTCH-2589
Project: Nutch
Issue Type: Bug
Environment: nutch-site.xml:
{code:xml}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-http|parse-tika</value>
</property>
<property>
<name>http.agent.name</name>
<value>blah</value>
</property>
</configuration>
{code}
fetched url: https://policies.google.com/technologies/ads
Reporter: Gerard Bouchar
Html redirections using meta tags are supported in nutch. They work well when using parse-html to parse files. However, when using parse-tika, they are not detected.
This is because of https://issues.apache.org/jira/browse/TIKA-2652
Tika emits redirection meta tags as :
{code:xml}
<meta name="refresh" content="0; url=http://example.com"/>
{code}
whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following format :
{code:xml}
<meta http-equiv="refresh" content="0; url=http://example.com">
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)