You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Karen Church <ka...@ucd.ie> on 2005/10/27 12:44:45 UTC

Crawler Behaviour

Hi all,

I was wondering if you could help me understand the behaviour of the Nutch crawler a little better.

Lets say I start 2 separate Nutch crawlers on the exact same day at the exact same time and I carry out a whole web crawl of 125 hosts/seed urls to a depth of 5, where the db.max.outlinks.per.page is set to the default value of 100 on both machines - Will the output be the same/similar?

Aside from the fact that some pages may not be accessible due to HTTP errors, will anything built into Nutch affect the output? For example the db.max.outlinks.per.page property, will that effect anything? As far as I'm aware this property means that at most, 100 outlinks will be processed per page regardless of how many outlinks were originally extracted from the page.  As long as these outlinks are processed in the same order the output will be the same but are these outlinks processed in a random fashion? What about the fact the Nutch randomizes its fetchlists - would that cause issues between crawls?

Thanks,

Karen