You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2009/04/03 19:56:02 UTC

Re: robots.txt redirect (NUTCH-124)

Hi Mathijs,

I've posted a patch for this on
https://issues.apache.org/jira/browse/NUTCH-731

HTH

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/3/17 Mathijs Homminga <ma...@gmail.com>

> Hi everybody,
>
> Can someone shine a light on NUTCH-124:
> RobotRulesParser.java doesn't follow redirects when requesting the
> robots.txt file. Doug patched this, but that didn't make it to the trunk.
> What is the wished behavior here?
>
>
> For example, when requesting the following url:
> http://7is7.com/software/stateye/download/stateye097f.html
>
> ... RobotRulesParser requests the following robots.txt:
> http://7is7.com/robots.txt
>
> ... however, that file doesn't exist, it redirects to:
> http://www.7is7.com/robots.txt
>
> ... that robots.txt tells us the initial url is disallowed.
> But does it really? Or is robots.txt file only applicable to
> http://www.7is7.com and not http://7is7.com.
>
> So the question is: should we follow such redirects?
>
> Thanks,
> Mathijs
>