You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ye T Thet <ye...@gmail.com> on 2013/03/11 17:45:45 UTC

Nutch 1.x crawler deployment configuration

Hi Folks,

This is quite a lenghy question. I hope someone would be patient enough to
go through it and give me some tips or share their experience.

I am seeking some advise on how I can deploy the Nutch 1.x crawler for my
scenario for actual data set. I have 5,000 hosts that I am crawling. I am
using Amazon EC2 to crawl.

My current approach is as followings:
Split 500 hosts each.
Using 10 crawlers to fetch URLs from 5,000 hosts. Each crawler is fetching
URLs from 500 hosts up to depth 10.
A crawler is EC2 medium instance (*2 CPU with 3.7 Gib Memory, 410 Gib
Storage*).
Crawlers take 10 to 14 days to complete the crawl of depth 10. Average of
15 GiB of crawl data and 300k URLs per crawler. Thus total of 150 Gib crawl
data and 3 Millions URL from 5,000 hosts.
Once the crawl is complete, I put the data to S3 for further processing.
I use the High-Memory Quadruple Extra Large Instance (*25 CPU, 67 GiB
memory 1690 Gib Storage*) to merge the crawlDB, segments and index. It took
around 12 hours to process.
I am doing all in Local Mode (no hadoop cluster), thus no advantage of
Map/Reduce I assume.

My crawler are configured as followings:
Java Xmx in *bin\nutch to 32,00 MB* for my physical memory of a crawler is
3.7 Gib (leaving 500 Mib for other system resources)
Emphasis here again is each crawler is running local mode not cluster.
I also set fetcher.parse to false in nutch-site.xml thus the crawler
fetched all the URL first and then parsed the fetched segments.
I use 200 fetcher threads  (200 over kill?)
For the politeness reason, I am using nutch default settings: 1 thread per
host and 0.5 delay for next fetch on single host.

What I learned from previous discussion and other entries[2] on the mailing
list, My approach is highly *NOT* efficient. One bottleneck I observed in
my current deployment is that the crawler takes around 32 to 48 hours to
parse 3Gib fetched segments of estimated 100k URLs.

One improvement on top of my mind is to use hadoop cluster to crawl all
5000 hosts at once to I could take advantage of map reduce processing
model. In this case what would be the recommended settings:
Spec for a node ( 2 CPU, 3.7 Gib memory 410 Gib storage enough?)
Number of nodes ( 5 slaves and 1 master for the starter?)
Xmx setting in bin/nutch, ( not sure if it should be 1000 Mb or to the max
of physical memory)
Should I Separate or combine the fetch and parse?
Number of concurrent thread for the crawl (passing in as argument to
bin\nutch)
Number of mapper and reducer for per node.

My goal here is to save as much as I can on EC2 bill. :) I would love to
hear your opinion or the advise on the matter to get me to the right
direction faster. :)

Thanks a lot,

Ye

[1]
http://lucene.472066.n3.nabble.com/Parse-benchmark-performance-td4045827.html<http://lucene.472066.n3.nabble.com/Parse-benchmark-performance-td4045827.html>.

[2]
http://lucene.472066.n3.nabble.com/Differences-between-2-1-and-1-6-td4042856.html