You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sergei Surovtsev <cy...@gmail.com> on 2010/08/14 00:52:24 UTC

Fwd: Crawl performance problem on 5 xeon machines

Hello.

I have problem with a cluster which consists of 5 2U Xeon 5600 16GB RAm
machines. I'm trying to perform a relatively big crawl on it: 500M pages.
Bandwidth is at 1Gbit.

I can't utilize even 10% of the channel and can't get over 64 threads per
machine without reasonable number of exceptions. The best performance i was
getting is ~ 250 pages per second on whole cluster.

The setup is: 16 maps / 16 reduces per cluster, max number of maps: 160,
reduces: 80. Each fetch map uses 16 threads for downloading.
DNS setup is Ok: 2-layer scheme with local DNS with cache 2GB and Back-end
DNSes like GoogleDNS, OpenDNS and ISP's DNS.

Config is:

*conf/core-site.xml:
fs.inmemory.size.mb=  200

conf/hdfs-site.xml*:
dfs.replication=2

*conf/mapred-site.xml*:
mapred.map.tasks=16
mapred.reduce.tasks=16
mapred.tasktracker.map.tasks.maximum=160
mapred.tasktracker.reduce.tasks.maximum=160
mapred.child.java.opts=-Xmx2048m(default)

I tried different setups like 8 maps * 8 threads and 16 maps * 4
threads. When the actual number of threads gets over 64 or dl/speed
gets over 150 pages/second the number of exceptions grows
exponentially. the end result is usually 50000 fetched pages and 25000
exceptions.

The exception I get is "ConnectException: Operation not permitted".
Underlying OS is FreeBSD 7.3 which is able to support large number of
concurrent connections.

Thanks

avoid merging

Posted by Patricio Galeas <pg...@yahoo.de>.

Hello,

I'm running Nutch in a single machine. In this case, is it possible to exclude 
the merging (segment) process?
That is: fetch -> updatedb -> inverlinks -> index -> dedup -> merge (index) 

Thanks
pgaleas