You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/10/01 06:31:50 UTC
Re: mapred -numFetchers gone?
Rod Taylor wrote:
> With -numFetchers gone it appears I require a generate/update for each
> fetch which serializes the process.
That's correct. It would be possible to implement something like the
former behaviour by (as before) setting page's nextFetch date to a week
out when they're added to a fetchlist. But, in mapreduce, dbupdate and
generate are much faster, both since the crawldb doesn't have links (and
is thus a lot smaller) and the crawldb update is distributed, so the
downtime between fetcher cycles is much less and this technique may not
be required. Previously dbupdate took nearly as long as fetches, so
parallelizing these made a big difference. But now, in my experience,
the dbupdate/generate overhead is more like 10-20%. With mapreduce,
what percent of the time do you find that you're not fetching?
Doug
Re: mapred -numFetchers gone?
Posted by Rod Taylor <rb...@sitesell.com>.
On Fri, 2005-09-30 at 21:31 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > With -numFetchers gone it appears I require a generate/update for each
> > fetch which serializes the process.
> parallelizing these made a big difference. But now, in my experience,
> the dbupdate/generate overhead is more like 10-20%. With mapreduce,
> what percent of the time do you find that you're not fetching?
At this moment I have an overloaded router causing communication
problems between systems. So I get a ton of socket timeouts which can
cause reduce %age complete to go backward.
I'll get back to you when I have a few hundred million more pages and a
corrected network.
--
Rod Taylor <rb...@sitesell.com>