You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2006/12/19 22:09:03 UTC
Re: large number of urls from Generator are not fetched?
For anyone searching this thread in the future. One possible cause of
this is when the hadoop nodes are not time synchronized with ntp or
something similar.
For example if one or more of the slave nodes is a few minutes ahead of
the others and an inject job is run on one of those nodes (and this is
pretty much random and up to the system as to where a job is placed so
it wouldn't happen every time if only some of the nodes are out of sync)
after which a generate job is run on any node that is behind the out of
sync nodes (again random), then then some of the urls may not get
fetched because their starting fetch time in crawl db is later than the
current time on the machine that is doing the generate task.
Being out of sync also seems to affect other thing such as task stalling
for a couple of minutes, etc. but I don't have specific information on
that. The fix for this is to setup the nodes to access a a time server
in your network or setup the nodes to access a public time server and in
either case make sure your nodes are time synchronized by having ntp run
on startup.
Dennis
AJ Chen wrote:
> Any idea why nutch (0.9-dev) does not try to fetch every url
> generated? For
> example, if Generator generates 200,000 urls, maybe <100,000 urls will be
> fetched, succeeded or failed. This is a big difference, which is
> obvious by
> checking the number of urls in the log or run readseg -list. What
> causes a
> large number of urls get thrown out by the Fetcher?
>
> Thanks,