You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "nutch.buddy@gmail.com" <nu...@gmail.com> on 2012/03/11 19:53:11 UTC

crawling with -1 as fetch.interval causes all pages to be refetched at same running instance

Hi,

I'm running nutch 1.2 with fetch.interval.default = -1.
I want all pages to be refetched each run.

Also, I'm running nutch so it will process a single url per
generate-fetch-parse-update cycle.

What actually happens seems obvious after it has - each cycle nutch fetches
not only the url it gets but also all the previous urls. I guess because the
crawldb already has them, on the one hand, and there's no fetch interval to
prevent this on the other hand.

So from having all pages refetched each run, I turned out having all pages
refetched many times each run.

Any way to prevent this behaivour?



--
View this message in context: http://lucene.472066.n3.nabble.com/crawling-with-1-as-fetch-interval-causes-all-pages-to-be-refetched-at-same-running-instance-tp3817167p3817167.html
Sent from the Nutch - User mailing list archive at Nabble.com.