You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2019/10/18 14:00:01 UTC

[jira] [Created] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

Sebastian Nagel created NUTCH-2748:
--------------------------------------

             Summary: Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb
                 Key: NUTCH-2748
                 URL: https://issues.apache.org/jira/browse/NUTCH-2748
             Project: Nutch
          Issue Type: Bug
          Components: crawldb, fetcher
    Affects Versions: 1.16
            Reporter: Sebastian Nagel
             Fix For: 1.17


If fetcher is following redirects and the max. number of redirects in a redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone" and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as there has been no fetch of the final redirect target and indeed nothing is know about it's status. An wrong db_gone may then cause that a page gets deleted from the search index.

There are two possible solutions:
1. ignore protocol status "redir_exceeded" during CrawlDb update
2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead of a fetch_gone

Solution 2. seems easier to implement and it would be possible to make the behavior configurable:
- store redirect (fetch_redir_temp or fetch_redir_perm)
- store "fetch_gone" (current behavior)
- store nothing, i.e. ignore those redirects - this should be the default as it's close to the current behavior without the risk to accidentally set successful fetches to db_gone







--
This message was sent by Atlassian Jira
(v8.3.4#803005)