You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2005/11/05 04:39:19 UTC
[jira] Created: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt
protocol-httpclient does not follow redirects when fetching robots.txt
----------------------------------------------------------------------
Key: NUTCH-124
URL: http://issues.apache.org/jira/browse/NUTCH-124
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.8-dev, 0.7.2-dev
Reporter: Doug Cutting
If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site. See http://www.webmasterworld.com/forum11/3008.htm.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt
Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-124?page=comments#action_12357186 ]
Fuad Efendi commented on NUTCH-124:
-----------------------------------
Is such behavior defined in Robots Exclusion Protocol? http://www.robotstxt.org/ If so, it should be some kind of a new field in robots.txt in a source site! Such as
Redirect-Disallow: Nutch
Just compare with Nutch behavior when one site has a link to a page on a second site, and second one has "Disallow" for this page. Nutch handles it correctly. It uses Robots.txt file from the same site as the web page.
Robots.txt MUST NOT define behavior for foreign sites.
> protocol-httpclient does not follow redirects when fetching robots.txt
> ----------------------------------------------------------------------
>
> Key: NUTCH-124
> URL: http://issues.apache.org/jira/browse/NUTCH-124
> Project: Nutch
> Type: Bug
> Components: fetcher
> Versions: 0.8-dev, 0.7.2-dev
> Reporter: Doug Cutting
> Fix For: 0.8-dev
>
> If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site. See http://www.webmasterworld.com/forum11/3008.htm.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient
does not follow redirects when fetching robots.txt
Posted by Doug Cutting <cu...@nutch.org>.
Massimo Miccoli wrote:
> Ther's a problem with that solution. The protocol-httpclient now , for
> some site, gerate a SEVERE Narrowly avoided an infinite loop in execute
> So the fetcher exit ands only some pages is fetched until the SEVERE
> message.
> I don't know a solution, for now I switch back to protocoll-http.
Can you provide more details?
Thanks,
Doug
Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient
does not follow redirects when fetching robots.txt
Posted by Massimo Miccoli <mm...@iltrovatore.it>.
Ther's a problem with that solution. The protocol-httpclient now , for
some site, gerate a SEVERE Narrowly avoided an infinite loop in execute
So the fetcher exit ands only some pages is fetched until the SEVERE
message.
I don't know a solution, for now I switch back to protocoll-http.
Doug Cutting (JIRA) ha scritto:
> [ http://issues.apache.org/jira/browse/NUTCH-124?page=all ]
>
>Doug Cutting resolved NUTCH-124:
>--------------------------------
>
> Fix Version: 0.8-dev
> Resolution: Fixed
>
>I have fixed this in the mapred branch.
>
>
>
>>protocol-httpclient does not follow redirects when fetching robots.txt
>>----------------------------------------------------------------------
>>
>> Key: NUTCH-124
>> URL: http://issues.apache.org/jira/browse/NUTCH-124
>> Project: Nutch
>> Type: Bug
>> Components: fetcher
>> Versions: 0.8-dev, 0.7.2-dev
>> Reporter: Doug Cutting
>> Fix For: 0.8-dev
>>
>>
>
>
>
>>If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site. See http://www.webmasterworld.com/forum11/3008.htm.
>>
>>
>
>
>
[jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-124?page=all ]
Doug Cutting resolved NUTCH-124:
--------------------------------
Fix Version: 0.8-dev
Resolution: Fixed
I have fixed this in the mapred branch.
> protocol-httpclient does not follow redirects when fetching robots.txt
> ----------------------------------------------------------------------
>
> Key: NUTCH-124
> URL: http://issues.apache.org/jira/browse/NUTCH-124
> Project: Nutch
> Type: Bug
> Components: fetcher
> Versions: 0.8-dev, 0.7.2-dev
> Reporter: Doug Cutting
> Fix For: 0.8-dev
>
> If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site. See http://www.webmasterworld.com/forum11/3008.htm.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira