You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/10/01 06:31:50 UTC

Re: mapred -numFetchers gone?

Rod Taylor wrote:
> With -numFetchers gone it appears I require a generate/update for each
> fetch which serializes the process.

That's correct.  It would be possible to implement something like the 
former behaviour by (as before) setting page's nextFetch date to a week 
out when they're added to a fetchlist.  But, in mapreduce, dbupdate and 
generate are much faster, both since the crawldb doesn't have links (and 
is thus a lot smaller) and the crawldb update is distributed, so the 
downtime between fetcher cycles is much less and this technique may not 
be required.  Previously dbupdate took nearly as long as fetches, so 
parallelizing these made a big difference.  But now, in my experience, 
the dbupdate/generate overhead is more like 10-20%.  With mapreduce, 
what percent of the time do you find that you're not fetching?

Doug

Re: mapred -numFetchers gone?

Posted by Rod Taylor <rb...@sitesell.com>.

On Fri, 2005-09-30 at 21:31 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > With -numFetchers gone it appears I require a generate/update for each
> > fetch which serializes the process.

> parallelizing these made a big difference.  But now, in my experience, 
> the dbupdate/generate overhead is more like 10-20%.  With mapreduce, 
> what percent of the time do you find that you're not fetching?

At this moment I have an overloaded router causing communication
problems between systems. So I get a ton of socket timeouts which can
cause reduce %age complete to go backward.

I'll get back to you when I have a few hundred million more pages and a
corrected network.

-- 
Rod Taylor <rb...@sitesell.com>