You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2023/03/17 15:50:00 UTC

[jira] [Created] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

Sebastian Nagel created NUTCH-2990:
--------------------------------------

             Summary: HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
                 Key: NUTCH-2990
                 URL: https://issues.apache.org/jira/browse/NUTCH-2990
             Project: Nutch
          Issue Type: Improvement
          Components: protocol, robots
    Affects Versions: 1.19
            Reporter: Sebastian Nagel
             Fix For: 1.20


The robots.txt parser ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html]) follows only one redirect when fetching the robots.txt while the robots.txt RFC 9309 recommends to follow 5 redirects:

{quote} 2.3.1.2. Redirects

It's possible that a server responds to a robots.txt fetch request with a redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers SHOULD follow at least five consecutive redirects, even across authorities (for example, hosts in the case of HTTP).
If a robots.txt file is reached within five consecutive redirects, the robots.txt file MUST be fetched, parsed, and its rules followed in the context of the initial authority. If there are more than five consecutive redirects, crawlers MAY assume that the robots.txt file is unavailable.
(https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}

While following redirects, the parser should check whether the redirect location is itself a "/robots.txt" on a different host and then try to read it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)