You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2018/06/28 09:18:00 UTC
[jira] [Updated] (NUTCH-2573) Suspend crawling if robots.txt fails
to fetch with 5xx status
[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2573:
-----------------------------------
Fix Version/s: (was: 1.15)
1.16
> Suspend crawling if robots.txt fails to fetch with 5xx status
> -------------------------------------------------------------
>
> Key: NUTCH-2573
> URL: https://issues.apache.org/jira/browse/NUTCH-2573
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.16
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable interval when fetching the robots.txt fails with a server errors (HTTP status code 5xx, esp. 503) following [Google's spec| https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow" of crawling. The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error will result in fairly frequent retrying. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code. Handling of a permanent server error is undefined.??
> Crawler-commons robots rules already provide [isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--] to store this information (must be set from RobotRulesParser).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)