You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sriram Nookala (JIRA)" <ji...@apache.org> on 2017/03/09 15:01:38 UTC
[jira] [Created] (NUTCH-2365) HTTP Redirects to SubDomains don't
get crawled
Sriram Nookala created NUTCH-2365:
-------------------------------------
Summary: HTTP Redirects to SubDomains don't get crawled
Key: NUTCH-2365
URL: https://issues.apache.org/jira/browse/NUTCH-2365
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.12
Environment: Fedora 25
Reporter: Sriram Nookala
Crawling a domain http://www.mercenarytrader.com which redirects to https://members.mercenarytrader.com which doesn't get followed by Nutch even though 'db.ignore.external.links' is set to 'true' and 'db.ignore.external.links.mode' is set to 'byDomain'.
The bug is in FetcherThread where the comparison is by host and not by domain
String origHost = new URL(urlString).getHost().toLowerCase();
> String newHost = new URL(newUrl).getHost().toLowerCase();
> if (ignoreExternalLinks) {
> if (!origHost.equals(newHost)) {
> if (LOG.isDebugEnabled()) {
> LOG.debug(" - ignoring redirect " + redirType + " from "
> + urlString + " to " + newUrl
> + " because external links are ignored");
> }
> return null;
> }
> }
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)