You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Srinivasan Ramaswamy <ur...@gmail.com> on 2017/05/23 18:34:30 UTC
Local mode vs Distributed mode ? Which one is faster for doing deep
crawl of few domains ?
Hi All
We have a few domains and we would like to crawl all pages (deep crawling)
from those domains (excluding external links).
We started with a domain that has 400 urls and started crawling using
Nutch. Here is the time taken between the two modes for the smaller domain
local mode = 5 minutes
distributed mode (a cluster of 3 nodes) = 2 hours
We tried the same with a domain that has > 100K urls and local mode still
seem to be faster. Time taken for the bigger domain
local mode crawled 28K urls in 4 hours
distributed mode crawled only 12k urls in 11hours
When i looked into the information printed in console, I saw that it runs a
mapreduce job for every step in each iteration in distributed mode. It
looked to me like these map reduce jobs for not so big number of urls are
slowing things down.
Here is some of the configuration
db.ignore.external.links=true
fetcher.server.delay=0.1
fetcher.queue.mode=byHost
smaller domain
fetcher.threads.fetch=100
fetcher.threads.per.queue=100
bigger domain (as we wanted to see whether number of threads make a
difference)
fetcher.threads.fetch=400
fetcher.threads.per.queue=200
The performance looks surprisingly slow. Are we missing something ? Any
suggestion would be really appreciated.
Thanks
Srini