You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/01/15 13:17:00 UTC

[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

    [ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476606#comment-17476606 ] 

ASF GitHub Bot commented on NUTCH-2573:
---------------------------------------

sebastian-nagel opened a new pull request #724:
URL: https://github.com/apache/nutch/pull/724


   - add properties
     - `http.robots.503.defer.visits` :
       enable/disable the feature (default: enabled)
     - `http.robots.503.defer.visits.delay` :
       delay to wait before the next trial to fetch the properties
       (default: wait 5 minutes)
     - `http.robots.503.defer.visits.retries` :
       max. number of retries before giving up and dropping all URLs from the given host / queue
       (default: give up after the 3rd retry, ie. after 4 attempts)
   - handle HTTP 5xx in robots.txt parser
   - handle delay, retries and dropping queues in Fetcher
   
   Stop queuing fetch items if timelimit is reached
   - re-queuing items where the robots.txt request returned a 5xx
   - redirects (http.redirect.max > 0) or
   - outlinks (fetcher.follow.outlinks.depth > 0)
   
   In a first version, I forgot to verify whether the Fetcher timelimit (`fetcher.timelimit.mins`) was already reached before re-queuing the fetch item. This caused very few fetcher task to end up in an infinite loop. In detail, this happened:
   1. fetcher thread starts fetching an item and requests the corresponding robots.txt. Ev., the server responds slowly.
   2. fetcher timelimit is reached, all fetcher queues are flushed
   3. robots.txt response "arrived". Because it's a 5xx the fetch item is re-queued and the fetch is delayed for 30 min. (custom configuration).
   
   Then steps 1 and 3 are retried until the max number of retries is reached. But this was fixed and I've also made sure that redirects or outlinks are not queued if the timelimit is reached.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Suspend crawling if robots.txt fails to fetch with 5xx status
> -------------------------------------------------------------
>
>                 Key: NUTCH-2573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable interval when fetching the robots.txt fails with a server errors (HTTP status code 5xx, esp. 503) following [Google's spec| https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow" of crawling. The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error will result in fairly frequent retrying. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code. Handling of a permanent server error is undefined.??
> See also the [draft robots.txt RFC, section "Unreachable status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4].
> Crawler-commons robots rules already provide [isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--] to store this information (must be set from RobotRulesParser).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)