You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2023/03/17 15:50:00 UTC
[jira] [Created] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
Sebastian Nagel created NUTCH-2990:
--------------------------------------
Summary: HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
Key: NUTCH-2990
URL: https://issues.apache.org/jira/browse/NUTCH-2990
Project: Nutch
Issue Type: Improvement
Components: protocol, robots
Affects Versions: 1.19
Reporter: Sebastian Nagel
Fix For: 1.20
The robots.txt parser ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html]) follows only one redirect when fetching the robots.txt while the robots.txt RFC 9309 recommends to follow 5 redirects:
{quote} 2.3.1.2. Redirects
It's possible that a server responds to a robots.txt fetch request with a redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers SHOULD follow at least five consecutive redirects, even across authorities (for example, hosts in the case of HTTP).
If a robots.txt file is reached within five consecutive redirects, the robots.txt file MUST be fetched, parsed, and its rules followed in the context of the initial authority. If there are more than five consecutive redirects, crawlers MAY assume that the robots.txt file is unavailable.
(https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
While following redirects, the parser should check whether the redirect location is itself a "/robots.txt" on a different host and then try to read it from the cache.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)