You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2020/03/20 18:53:00 UTC
[jira] [Created] (NUTCH-2776) Fetcher to temporarily deduplicate
followed redirects
Sebastian Nagel created NUTCH-2776:
--------------------------------------
Summary: Fetcher to temporarily deduplicate followed redirects
Key: NUTCH-2776
URL: https://issues.apache.org/jira/browse/NUTCH-2776
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 1.16
Reporter: Sebastian Nagel
Fix For: 1.17
If fetcher follows redirect (http.redirect.max > 0), it may happen that many redirects of a site point to the same URL. In this situation, it might be good if fetcher could temporarily (for a configurable time period) deduplicate the redirect targets and skip all redirects except the first one. Typical examples of duplicated redirect targets are:
- instead of responding with HTTP status 404:
{noformat}
/
/resource-not-found
/search/
/404
/error/not-found
/err/notfound.html{noformat}
- a page to accept/decline cookies
{noformat}
/cookie_usage.php
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)