You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mos <mo...@gmail.com> on 2006/02/02 16:39:01 UTC

Wrong 'Next Fetch' Date

Hello,

just a view days ago we started to use Nutch (0.7.1).
It's really nice and I would like to see it evolve.

Here's my issue/question:

While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html
failed with: java.lang.Exception:
org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry
later.
That seems to be ok and indicates some network problems.

The problem is that the entry in the Webdb shows the following:

Page 4: Version: 4
URL: http://www.test-domain.de/crawl_html/page_2.html
ID: b360ec931855b0420776909bd96557c0
Next fetch: Sun Aug 17 07:12:55 CET 292278994
Retries since fetch: 0
Retry interval: 0 days

The 'Next fetch' date is set to the year '292278994'.
Probably I wouldn't be able to see the refetch alive. ;)

What's wrong here? I hope it's not my lifespan. ;)
A page that couldn't be crawled because of networks-problems,
should be refetched with the next crawl (== set next fetch date to the
next day).

I'm just using standard api of nutch 0.7.1 like:

WebDBWriter webdb = new WebDBWriter(fileSystem, new File(dbPath));
UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, true, -1);
tool.updateForSegment(fileSystem, lseg);
tool.close();

Thanks
mos

Wrong 'Next Fetch' Date

Posted by mos <mo...@gmail.com>.
Hello,

just a view days ago we started to use Nutch (0.7.1).
It's really nice and I would like to see it evolve.

Here's my issue/question:

While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html
failed with: java.lang.Exception:
org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry
later.
That seems to be ok and indicates some network problems.

The problem is that the entry in the Webdb shows the following:

Page 4: Version: 4
URL: http://www.test-domain.de/crawl_html/page_2.html
ID: b360ec931855b0420776909bd96557c0
Next fetch: Sun Aug 17 07:12:55 CET 292278994
Retries since fetch: 0
Retry interval: 0 days

The 'Next fetch' date is set to the year '292278994'.
Probably I wouldn't be able to see the refetch alive. ;)

What's wrong here? I hope it's not my lifespan. ;)
A page that couldn't be crawled because of networks-problems,
should be refetched with the next crawl (== set next fetch date to the
next day).

I'm just using standard api of nutch 0.7.1 like:

WebDBWriter webdb = new WebDBWriter(fileSystem, new File(dbPath));
UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, true, -1);
tool.updateForSegment(fileSystem, lseg);
tool.close();

Thanks
mos