You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sergei Surovtsev <cy...@gmail.com> on 2010/08/14 00:52:24 UTC
Fwd: Crawl performance problem on 5 xeon machines
Hello.
I have problem with a cluster which consists of 5 2U Xeon 5600 16GB RAm
machines. I'm trying to perform a relatively big crawl on it: 500M pages.
Bandwidth is at 1Gbit.
I can't utilize even 10% of the channel and can't get over 64 threads per
machine without reasonable number of exceptions. The best performance i was
getting is ~ 250 pages per second on whole cluster.
The setup is: 16 maps / 16 reduces per cluster, max number of maps: 160,
reduces: 80. Each fetch map uses 16 threads for downloading.
DNS setup is Ok: 2-layer scheme with local DNS with cache 2GB and Back-end
DNSes like GoogleDNS, OpenDNS and ISP's DNS.
Config is:
*conf/core-site.xml:
fs.inmemory.size.mb= 200
conf/hdfs-site.xml*:
dfs.replication=2
*conf/mapred-site.xml*:
mapred.map.tasks=16
mapred.reduce.tasks=16
mapred.tasktracker.map.tasks.maximum=160
mapred.tasktracker.reduce.tasks.maximum=160
mapred.child.java.opts=-Xmx2048m(default)
I tried different setups like 8 maps * 8 threads and 16 maps * 4
threads. When the actual number of threads gets over 64 or dl/speed
gets over 150 pages/second the number of exceptions grows
exponentially. the end result is usually 50000 fetched pages and 25000
exceptions.
The exception I get is "ConnectException: Operation not permitted".
Underlying OS is FreeBSD 7.3 which is able to support large number of
concurrent connections.
Thanks
avoid merging
Posted by Patricio Galeas <pg...@yahoo.de>.
Hello,
I'm running Nutch in a single machine. In this case, is it possible to exclude
the merging (segment) process?
That is: fetch -> updatedb -> inverlinks -> index -> dedup -> merge (index)
Thanks
pgaleas