You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Sebastian Nagel | exorbyte <se...@exorbyte.com> on 2011/03/17 14:28:00 UTC

indexer, continued crawl, and redirects

Hello,

with a continued crawl a couple of URLs definitely fetched during the latest crawl
are missing in the index. The indexer is run on all segments including those from
previous crawler runs. The aim is to reduce the time of a daily crawler run by
avoiding to fetch content of not modified pages and making use of adaptive
(re)fetch scheduling.

The problem is caused by the way the crawled web server handles errors:
instead of an immediate 404 an redirect to an error page is sent.

If the following constellation is hit a document may get lost from the index:

  day one
    http://xyz.com/page1.aspx  (fetch_success)

  day two (server problems)
    http://xyz.com/page1.aspx  (fetch_redir_temp)
      > http://xyz.com/error.aspx  (404: fetch_gone)

  day three (server ok)
    http://xyz.com/page1.aspx  (fetch_success)

The primary sorting criterion of CrawlDatum is the score,
so if the redirected page (resp. CrawlDatum) by accident gets an higher
score than the latest one this page may get lost although
it has been fetched successfully during the last crawler run.

The following patch would solve the problem:

          else if (CrawlDatum.hasFetchStatus(datum)) {
            // don't index unmodified (empty) pages
-          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
-            fetchDatum = datum;
+          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
+               // take the latest fetch datum regardless of the sorting of CrawlDatum
+               if (fetchDatum == null || datum.getFetchTime() >= fetchDatum.getFetchTime())
+                       fetchDatum = datum;
+          }
          } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                     CrawlDatum.STATUS_SIGNATURE == datum.getStatus() ||
                     CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {

The latest fetch datum is taken except it is fetch_notmodified.


Are there any pitfalls? The situation is somewhat complicated.

By the way:

1. Is there a reason why the fetchDatum is checked at all?
    It could be enough to take the latestes Content, ParseData, etc.
    if the current dbDatum is db_fetched or db_notmodified (or ...)

2. What about the SegmentMerger? The reduce function of both Indexer and SegmentMerger
    should behave similar if not identical. I had a look: the SegmentMerger, apparently keeps
    the latest fetchDatum (determined by the segment name/time-stamp). It does not check
    for fetch_notmodified. I didn't run a test if this definitely leads to lost documents.

Regards and thanks,

Sebastian

P.S.: Of course, I agree that sending a redirect in the case of a temporary server failure is not
best practice. But I cannot change...