You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2005/09/30 16:46:01 UTC

mapred -numFetchers gone?

I used to use -numFetchers to break a single fetch into multiple blocks
allowing to easy retries as well as some overlap with generate and
update.

For example:

generate -numFetchers 4 (blocks 1 through 4)
fetch block1 & fetch block2   (2 threads)
updatedb block1 block2 & fetch block3   (2 threads)
generate -numFetchers 4 (blocks 5 through 8) & fetch block4  (2 threads)
fetch block5 & fetch block6 (2 threads)
updatedb block3 block4 block5 block6 & fetch block7 (2 threads)
generate -numFetchers 4 & fetch block8 (2 threads)

That is, I would have the generate/update cycle dependent on the success
of 50% of the queued fetchers, which meant the other 50% of fetchers was
available to retrieve data while the previous group was going through
the update/generate phase.

I managed to sustain 30Mb/sec this way (sustained meaning 24/7
downloading) until I hit about 150M pages.

With -numFetchers gone it appears I require a generate/update for each
fetch which serializes the process.

-- 
Rod Taylor <rb...@sitesell.com>


Re: mapred -numFetchers gone?

Posted by Rod Taylor <rb...@sitesell.com>.
On Fri, 2005-09-30 at 21:31 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > With -numFetchers gone it appears I require a generate/update for each
> > fetch which serializes the process.

> parallelizing these made a big difference.  But now, in my experience, 
> the dbupdate/generate overhead is more like 10-20%.  With mapreduce, 
> what percent of the time do you find that you're not fetching?

At this moment I have an overloaded router causing communication
problems between systems. So I get a ton of socket timeouts which can
cause reduce %age complete to go backward.

I'll get back to you when I have a few hundred million more pages and a
corrected network.

-- 
Rod Taylor <rb...@sitesell.com>


Re: mapred -numFetchers gone?

Posted by Doug Cutting <cu...@nutch.org>.
Rod Taylor wrote:
> With -numFetchers gone it appears I require a generate/update for each
> fetch which serializes the process.

That's correct.  It would be possible to implement something like the 
former behaviour by (as before) setting page's nextFetch date to a week 
out when they're added to a fetchlist.  But, in mapreduce, dbupdate and 
generate are much faster, both since the crawldb doesn't have links (and 
is thus a lot smaller) and the crawldb update is distributed, so the 
downtime between fetcher cycles is much less and this technique may not 
be required.  Previously dbupdate took nearly as long as fetches, so 
parallelizing these made a big difference.  But now, in my experience, 
the dbupdate/generate overhead is more like 10-20%.  With mapreduce, 
what percent of the time do you find that you're not fetching?

Doug

Re: mapred -numFetchers gone?

Posted by Michael Ji <fj...@yahoo.com>.
I believe it is not difficult to change code in
generate segment to break a signle fetchlist to
multiple segments

Michael Ji,

--- Rod Taylor <rb...@sitesell.com> wrote:

> I used to use -numFetchers to break a single fetch
> into multiple blocks
> allowing to easy retries as well as some overlap
> with generate and
> update.
> 
> For example:
> 
> generate -numFetchers 4 (blocks 1 through 4)
> fetch block1 & fetch block2   (2 threads)
> updatedb block1 block2 & fetch block3   (2 threads)
> generate -numFetchers 4 (blocks 5 through 8) & fetch
> block4  (2 threads)
> fetch block5 & fetch block6 (2 threads)
> updatedb block3 block4 block5 block6 & fetch block7
> (2 threads)
> generate -numFetchers 4 & fetch block8 (2 threads)
> 
> That is, I would have the generate/update cycle
> dependent on the success
> of 50% of the queued fetchers, which meant the other
> 50% of fetchers was
> available to retrieve data while the previous group
> was going through
> the update/generate phase.
> 
> I managed to sustain 30Mb/sec this way (sustained
> meaning 24/7
> downloading) until I hit about 150M pages.
> 
> With -numFetchers gone it appears I require a
> generate/update for each
> fetch which serializes the process.
> 
> -- 
> Rod Taylor <rb...@sitesell.com>
> 
> 



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com