You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andy Liu (JIRA)" <ji...@apache.org> on 2005/05/02 19:55:04 UTC

[jira] Updated: (NUTCH-56) Crawling sites with 403 Forbidden robots.txt

     [ http://issues.apache.org/jira/browse/NUTCH-56?page=all ]

Andy Liu updated NUTCH-56:
--------------------------

    Attachment: robots_403.patch

Add configuration parameter to allow the crawling of sites where a 403 is returned when accessing robots.txt

> Crawling sites with 403 Forbidden robots.txt
> --------------------------------------------
>
>          Key: NUTCH-56
>          URL: http://issues.apache.org/jira/browse/NUTCH-56
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Andy Liu
>     Priority: Minor
>  Attachments: robots_403.patch
>
> If a 403 error is encountered when trying to access the robots.txt file, Nutch does not crawl any pages from that site.  This behavior is consistent with the RFC recommendation for the robot exclusion protocol.  
> However, Google does crawl sites that exhibit this type of behavior, because most webmasters of these sites are unaware of robots.txt conventions and do want their site to be crawled.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira