You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/09/23 21:51:23 UTC
[jira] Closed: (NUTCH-205) Wrong 'fetch date' for non available pages

     [ http://issues.apache.org/jira/browse/NUTCH-205?page=all ]

Andrzej Bialecki  closed NUTCH-205.
-----------------------------------

    Fix Version/s: 0.8.1
                   0.9.0
       Resolution: Fixed

This issue has been fixed as a part of NUTCH-350.

> Wrong 'fetch date' for non available pages
> ------------------------------------------
>
>                 Key: NUTCH-205
>                 URL: http://issues.apache.org/jira/browse/NUTCH-205
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.7.1, 0.7
>         Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API
>            Reporter: M.Oliver Scheele
>             Fix For: 0.8.1, 0.9.0
>
>
> Web-Pages that couldn't be fetched because of a time-out wouldn't be refetched anymore.
> The next fetch in the web-db is set to Long.max.
> Example:
> -------------
> While fetching our URLs, we got some errors like this:
> 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html  failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded ttp.max.delays: retry later.
> That seems to be ok and indicates some network problems.
> The problem is that the entry in the Webdb shows the following:
> Page 4: Version: 4
> URL: http://www.test-domain.de/crawl_html/page_2.html
> ID: b360ec931855b0420776909bd96557c0
> Next fetch: Sun Aug 17 07:12:55 CET 292278994
> Retries since fetch: 0
> Retry interval: 0 days
> The 'Next fetch' date is set to the year '292278994'.
> Probably I wouldn't be able to see the refetch alive. ;)
> A page that couldn't be crawled because of networks-problems,
> should be refetched with the next crawl (== set next fetch date current time + 1h).
> Possible Bug-Fixing:
> ----------------------------
> When updating the web-db the method updateForSegment() in the UpdateDatabaseTool.class,
> set the fetch-date always to Long.max for any (unknown) exception during fetching.
> The RETRY status is not always set correctly.
> Change the following lines:
> } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
>                        page.getRetriesSinceFetch() < MAX_RETRIES) {
>               pageRetry(fo);                      // retry later
>             } else {
>               pageGone(fo);                       // give up: page is gone
>             }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira