You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sriram Nookala (JIRA)" <ji...@apache.org> on 2017/03/09 15:01:38 UTC

[jira] [Created] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled

Sriram Nookala created NUTCH-2365:
-------------------------------------

             Summary: HTTP Redirects to SubDomains don't get crawled
                 Key: NUTCH-2365
                 URL: https://issues.apache.org/jira/browse/NUTCH-2365
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.12
         Environment: Fedora 25
            Reporter: Sriram Nookala


Crawling a domain  http://www.mercenarytrader.com which redirects to https://members.mercenarytrader.com which doesn't get followed by Nutch even though 'db.ignore.external.links' is set to 'true' and 'db.ignore.external.links.mode' is set to 'byDomain'. 
  The bug is in FetcherThread where the comparison is by host and not by domain

String origHost = new URL(urlString).getHost().toLowerCase();
>       String newHost = new URL(newUrl).getHost().toLowerCase();
>       if (ignoreExternalLinks) {
>         if (!origHost.equals(newHost)) {
>           if (LOG.isDebugEnabled()) {
>             LOG.debug(" - ignoring redirect " + redirType + " from "
>                 + urlString + " to " + newUrl
>                 + " because external links are ignored");
>           }
>           return null;
>         }
>       }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)