You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Smith <mi...@gmail.com> on 2006/09/20 21:13:16 UTC

optimum number of map/reduce tasks?

I've been experimenting some distributed crawls using nutch 0.8 (SVN trunk
version) recently using five machines. one master node (namenode) and 4
slaves. if I use these settings in hadoop-site.xml all the injected urls
will be fetched:

 mapred.map.tasks = 17
 mapred.reduce.tasks = 11 or 13 or 17


but if I decrease the number of reducer as it is suggested close to the
number of host like:

 mapred.map.tasks = 17  mapred.reduce.tasks = 5 or 7

Then I will not have all the urls fetched and I will have %20 of injected
urls lost without any error log!?

Does any one know what are the optimum number of number of map and reduce
tasks and why decreasing the number of reducer which basically decreases the
number of fetcher causes loosing injected urls? Here are some more
benchmarking results:



map=17  red=11

Started at: 18:47:24 PDT 2006

Finished at: 21:45:31 PDT 2006

Threads: 100

Depth: 3

Fetched: 401913



map=17  red=7

Started at: 00:17:38 PDT 2006

Finished at: 02:58:12 PDT 2006

Threads: 100

Depth: 3

Fetched: 362628



map=17  red=5

Started at: 15:33:37 PDT 2006

Finished at: 15:39:50 PDT 2006

Threads:30

Depth: 1

Fetched: 1682



map=17  red=11

Started at: 15:46:00 PDT 2006

Finished at: 15:52:27 PDT 2006

Threads: 30

Depth: 1

Fetched: 1913



map=17  red=17

Started at: 18:12:26 PDT 2006

Finished at: 18:20:57 PDT 2006

Threads: 100

Depth: 1

Fetched: 1910