You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/03/20 19:16:00 UTC

[jira] [Commented] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects

    [ https://issues.apache.org/jira/browse/NUTCH-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063585#comment-17063585 ] 

ASF GitHub Bot commented on NUTCH-2776:
---------------------------------------

sebastian-nagel commented on pull request #505: NUTCH-2776 Fetcher to temporarily deduplicate followed redirects
URL: https://github.com/apache/nutch/pull/505
 
 
   - cache followed redirect targets for a configurable time (`fetcher.redirect.dedupcache.seconds`)
   - if a redirect target is found in cache it's skipped
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fetcher to temporarily deduplicate followed redirects
> -----------------------------------------------------
>
>                 Key: NUTCH-2776
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2776
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> If fetcher follows redirect (http.redirect.max > 0), it may happen that many redirects of a site point to the same URL. In this situation, it might be good if fetcher could temporarily (for a configurable time period) deduplicate the redirect targets and skip all redirects except the first one. Typical examples of duplicated redirect targets are:
> - instead of responding with HTTP status 404:
> {noformat}
> /
> /resource-not-found
> /search/
> /404
> /error/not-found
> /err/notfound.html{noformat}
> - a page to accept/decline cookies
> {noformat}
> /cookie_usage.php
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)