You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Fuad Efendi (JIRA)" <ji...@apache.org> on 2005/11/10 04:30:05 UTC

[jira] Commented: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

    [ http://issues.apache.org/jira/browse/NUTCH-124?page=comments#action_12357186 ] 

Fuad Efendi commented on NUTCH-124:
-----------------------------------

Is such behavior defined in Robots Exclusion Protocol? http://www.robotstxt.org/ If so, it should be some kind of a new field in robots.txt in a source site! Such as
Redirect-Disallow: Nutch

Just compare with Nutch behavior when one site has a link to a page on a second site, and second one has "Disallow" for this page. Nutch handles it correctly. It uses Robots.txt file from the same site as the web page. 

Robots.txt MUST NOT define behavior for foreign sites.


> protocol-httpclient does not follow redirects when fetching robots.txt
> ----------------------------------------------------------------------
>
>          Key: NUTCH-124
>          URL: http://issues.apache.org/jira/browse/NUTCH-124
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>     Reporter: Doug Cutting
>      Fix For: 0.8-dev

>
> If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site.  See http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira