You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Smith <mi...@gmail.com> on 2006/09/20 21:13:16 UTC
optimum number of map/reduce tasks?
I've been experimenting some distributed crawls using nutch 0.8 (SVN trunk
version) recently using five machines. one master node (namenode) and 4
slaves. if I use these settings in hadoop-site.xml all the injected urls
will be fetched:
mapred.map.tasks = 17
mapred.reduce.tasks = 11 or 13 or 17
but if I decrease the number of reducer as it is suggested close to the
number of host like:
mapred.map.tasks = 17 mapred.reduce.tasks = 5 or 7
Then I will not have all the urls fetched and I will have %20 of injected
urls lost without any error log!?
Does any one know what are the optimum number of number of map and reduce
tasks and why decreasing the number of reducer which basically decreases the
number of fetcher causes loosing injected urls? Here are some more
benchmarking results:
map=17 red=11
Started at: 18:47:24 PDT 2006
Finished at: 21:45:31 PDT 2006
Threads: 100
Depth: 3
Fetched: 401913
map=17 red=7
Started at: 00:17:38 PDT 2006
Finished at: 02:58:12 PDT 2006
Threads: 100
Depth: 3
Fetched: 362628
map=17 red=5
Started at: 15:33:37 PDT 2006
Finished at: 15:39:50 PDT 2006
Threads:30
Depth: 1
Fetched: 1682
map=17 red=11
Started at: 15:46:00 PDT 2006
Finished at: 15:52:27 PDT 2006
Threads: 30
Depth: 1
Fetched: 1913
map=17 red=17
Started at: 18:12:26 PDT 2006
Finished at: 18:20:57 PDT 2006
Threads: 100
Depth: 1
Fetched: 1910