You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2016/11/04 21:28:10 UTC
crawling speed when polite

Can anyone point me to some good information on how to optimize crawling speed while maintaining politeness?
My current situation is that Nutch is running reliably for me on a single hadoop node. Before bringing up additional nodes, I want to make it go reasonably fast on this one node. At the moment it is only trying to fetch less than 1 url per second. It seems like it should be able to do much more than this, but it is utilizing very little internet bandwidth and CPU time.

I originally seeded it with 6 urls, each on a different domain. I generate topN 1000 in each round. I have set generate.max.count to 100 and fetcher.server.delay to 1.0. I do not explicitly set any number of threads.
After 10 rounds, I get the following statistics. This took about 12 hours of elapsed time.
16/11/04 08:17:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: /orgs/data/crawldb16/11/04 08:17:47 INFO crawl.CrawlDbReader: TOTAL urls: 5697616/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 0:    5694916/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 1:    2716/11/04 08:17:47 INFO crawl.CrawlDbReader: min score:  0.016/11/04 08:17:47 INFO crawl.CrawlDbReader: avg score:  1.2285875E-416/11/04 08:17:47 INFO crawl.CrawlDbReader: max score:  1.016/11/04 08:17:47 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    4748616/11/04 08:17:47 INFO crawl.CrawlDbReader: status 2 (db_fetched):      669716/11/04 08:17:47 INFO crawl.CrawlDbReader: status 3 (db_gone): 242416/11/04 08:17:47 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   3816/11/04 08:17:47 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   20216/11/04 08:17:47 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    12916/11/04 08:17:47 INFO crawl.CrawlDbReader: CrawlDb statistics: doneFri Nov 4 08:17:48 PDT 2016 : Finished loop with 10 iterations
I use the standard crawl script, with only sizeFetchlist changed. It issues the following generate command
/home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true /orgs/data/crawldb /orgs/data/segments -topN 1000 -numFetchers 1 -noFilter -adddays 30

It issues the following fetch command
/home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 /orgs/data/segments/20161104110458 -noParsing -threads 50


Any suggestions would be greatly appreciated. By the way, thanks for all the help so far!