You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/08/08 17:48:55 UTC

Why isn't fetcher sending the last fetch time when it does a GET?

I'm watching my server logs as I do a second crawl of the site I
crawled yesterday, and it's getting HTTP response code 200 on every
page.  Since none of those pages have changed, ideally the fetcher
should send the last retrieval time in the HTTP header, and the server
would then respond with a 301 code, so it wouldn't have to reparse the
same page.  Wouldn't this be a major win in terms of bandwidth
consumed?  Certainly GoogleBot does it that way.

I'm doing the crawl using a slightly modified version of the script on the Wiki
http://wiki.apache.org/nutch/Crawl


-- 
http://www.linkedin.com/in/paultomblin