You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Santiago M. Mola" <co...@gmail.com> on 2013/08/19 18:58:56 UTC

nofollow behaviour [#NUTCH-693]

Hi,

I've experimented with Nutch for crawling Tor hidden services and I still
find an annoying issue that requires a patched Nutch version. That is
#NUTCH-693 [1]

This issue is a request for an option to control the behaviour of Nutch
when getting a rel="nofollow" link. Currently, Nutch always ignores such
links and there is no way of configuring this behaviour without patching it.

The issue was closed with little discussion claiming that such option would
be the same as an hypothetical "ignore.robotstxt" option. This is not the
case. robots.txt is the way for webmasters to prevent crawlers to access
certain URLs. This is *not* the job of nofollow. robots.txt is always
controlled by the webmaster and, as such, it makes sense to strictly honouw
it. On the other hand, nofollow is always controlled by third parties
(otherwise, robots.txt should be used) and its well-established use is
indicating non-endorsement to an URL. That is, in practice, preventing
giving link-juice to potential spammers.

nofollow is not meant to be an access control mechanism. nofollow is not
meant to protect websites from crawler abuse either. That is robots.txt's
job. So there is no point in treating them as the same.

Now, there are very real use cases for following links with the
rel="nofollow" attribute. In a loosely connected portion of the web,
following these links might be the only sane way to crawl successfully.

The Tor deepweb is a very clear case. There is a site which is very central
in the Tor link-graph: The Hidden Wiki. It is a great seed for crawling
Tor. But it's MediaWiki-based. And that means that every external link is
tagged as rel="nofollow". Finding enough good seed URLs to crawl Tor
without going through rel="nofollow" links is not trivial at all.

The same might happen when crawling corporate intranets, I2P or other
networks.

So there is a clear use case for adding an option for following
rel="nofollow" links. And, as far as I know, there is no point in not
adding it. That is why I would like this to be discussed and, if deemed
sensible, #NUTCH-693 reopened.

[1] https://issues.apache.org/jira/browse/NUTCH-693

Best,
-- 
Santiago M. Mola
Jabber ID: cooldwind@gmail.com