You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nutch developer <nu...@googlemail.com> on 2006/02/03 12:40:33 UTC

Updating with Last-Modified-Since header

Hello,

just one question regarding updating the content of a
crawled index.

Usually you set the "db.default.fetch.interval" property
for adjusting the time when a page should be refetched.
Then you do a generate/fetch/updatedb and all pages
that are older then the specified interval are crawled again.

The bad point is that all the html-pages are downloaded
again. And that even though if nothing changed.

What is about the http-headers Last-Modified-Since and
If-Modified-Since?
Could Nutch support this? This could reduce traffic and makes
the crawling a litte smarter....

Thanks
Oliver