You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2006/12/19 22:09:03 UTC

Re: large number of urls from Generator are not fetched?

For anyone searching this thread in the future.  One possible cause of 
this is when the hadoop nodes are not time synchronized with ntp or 
something similar.  

For example if one or more of the slave nodes is a few minutes ahead of 
the others and an inject job is run on one of those nodes (and this is 
pretty much random and up to the system as to where a job is placed so 
it wouldn't happen every time if only some of the nodes are out of sync) 
after which a generate job is run on any node that is behind the out of 
sync nodes (again random), then then some of the urls may not get 
fetched because their starting fetch time in crawl db is later than the 
current time on the machine that is doing the generate task.

Being out of sync also seems to affect other thing such as task stalling 
for a couple of minutes, etc. but  I don't have specific information on 
that.  The fix for this is to setup the nodes to access a a time server 
in your network or setup the nodes to access a public time server and in 
either case make sure your nodes are time synchronized by having ntp run 
on startup.

Dennis

AJ Chen wrote:
> Any idea why nutch (0.9-dev) does not try to fetch every url 
> generated? For
> example, if Generator generates 200,000 urls, maybe <100,000 urls will be
> fetched, succeeded or failed. This is a big difference, which is 
> obvious by
> checking the number of urls in the log or run readseg -list. What 
> causes a
> large number of urls get thrown out by the Fetcher?
>
> Thanks,