You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Joshua J Pavel <jp...@us.ibm.com> on 2011/01/26 19:28:06 UTC

Webserver configuration to successfully get modified time?


We've been crawling with nutch and deleting the crawldb between crawls.  I
believe I've managed to get my recrawl script to finally work, but I was
disappointed to see that in my db, the modified time of all of my pages is
Jan 1 1970.   Since I control both the crawler and the web server in our
setup, is there some setting that we can set to enable Nutch to
successfully get the modified time for the pages?  I want to reduce the
number of fetches as much as possible.

Thanks!