You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vijay Krishnan <vi...@gmail.com> on 2008/05/24 02:15:17 UTC

Ignoring robots.txt

Hi all,

    I wish to crawl a certain set of URLs to depth 1 (without any
deeper crawling) for further analysis. I find that nutch does not
crawl URLs which do not have the requisite permissions in robots.txt.
Is there some way I can disable nutch from looking at robots.txt? That
will make my job much easier than trying to save the webpages some
other way and then passing it through nutch.


Thanks
-- 
Vijay Krishnan
http://www.cs.stanford.edu/~vijayk