You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ankit <go...@hotmail.com> on 2015/04/18 08:39:40 UTC

Nutch 1.9 Error 403 Failed Fetch

Hi,I'm using Nutch 1.9 with Solr 4.9.1. I am trying to extract news articles. Nutch works for some sites, but for others I get 403 failed fetch. This is the output when I run parsechecker. bin/nutch parsechecker -dumpText http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977fetching: http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977Fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
WIth bin/crawl i getfetch of http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 failed with: Http code=403, url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977The regex filter for this site that I entered+^http://([a-z0-9]*\.)*dnaindia.comnutch-default.xml has this default value<property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description></property>Anything I am missing? For what reason am I still getting failed fetch?Ankit