You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2011/09/29 20:50:13 UTC
Finally got hadoop + nutch 1.3 + cygwin cluster working! ? now
I finally got a three machine cluster working with nutch 1.3, hadoop 0.20.0
and cygwin! I have a few questions about configuration.
I am only going to be crawling a few domains and I need this cluster to be
very fast. Right now it is slower using hadoop in distributed mode then
using just the local crawl. I am *guessing* that is due to the network
overhead? It is very, very slow.
What settings in mapred-site.xml and hdfs-site.xml might make my crawl
faster? Seems like the crawldb update takes the longest. I was digging
around in the hadoop documentation and the following seemed like good
settings:
mapred.reduce.tasks = <2 x slave processors>
mapred.map.tasks = <10 x the number of slave processors>
increase mapred.child.opts memory
Any thing else I am missing? What about running another crawl cycle
immediately after the first generate is complete? Would that cause problem
with concurrency and updating files/dbs?
--
View this message in context: http://lucene.472066.n3.nabble.com/Finally-got-hadoop-nutch-1-3-cygwin-cluster-working-now-tp3380170p3380170.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Finally got hadoop + nutch 1.3 + cygwin cluster working! ? now
Posted by Markus Jelsma <ma...@openindex.io>.
> I finally got a three machine cluster working with nutch 1.3, hadoop 0.20.0
> and cygwin! I have a few questions about configuration.
Glad to hear!
>
> I am only going to be crawling a few domains and I need this cluster to be
> very fast. Right now it is slower using hadoop in distributed mode then
> using just the local crawl. I am *guessing* that is due to the network
> overhead? It is very, very slow.
You need to know which component is slow, parse? fetch? update? etc. Keep in
mind that HDFS replicates blocks, that takes signifcant I/O.
>
> What settings in mapred-site.xml and hdfs-site.xml might make my crawl
> faster?
Impossible to tell.
> Seems like the crawldb update takes the longest.
Perhaps you filter and normalize in that step? You might not need to as
parsing already does it.
> I was digging
> around in the hadoop documentation and the following seemed like good
> settings:
Good defaults on most systems
>
> mapred.reduce.tasks = <2 x slave processors>
> mapred.map.tasks = <10 x the number of slave processors>
>
> increase mapred.child.opts memory
Only if you run out of memory.
>
> Any thing else I am missing? What about running another crawl cycle
> immediately after the first generate is complete? Would that cause problem
> with concurrency and updating files/dbs?
Yes although there's an option to solve that. Better generate many segments in
one go.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Finally-got-hadoop-nutch-1-3-cygwin-clu
> ster-working-now-tp3380170p3380170.html Sent from the Nutch - User mailing
> list archive at Nabble.com.