You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Srinivasan Ramaswamy <ur...@gmail.com> on 2017/05/23 18:34:30 UTC

Local mode vs Distributed mode ? Which one is faster for doing deep crawl of few domains ?

Hi All

We have a few domains and we would like to crawl all pages (deep crawling)
from those domains (excluding external links).

We started with a domain that has 400 urls and started crawling using
Nutch. Here is the time taken between the two modes for the smaller domain
local mode  = 5 minutes
distributed mode (a cluster of 3 nodes) = 2 hours

We tried the same with a domain that has > 100K urls and local mode still
seem to be faster. Time taken for the bigger domain

local mode crawled 28K urls in 4 hours
distributed mode crawled only 12k urls in 11hours

When i looked into the information printed in console, I saw that it runs a
mapreduce job for every step in each iteration in distributed mode. It
looked to me like these map reduce jobs for not so big number of urls are
slowing things down.

Here is some of the configuration

 db.ignore.external.links=true
 fetcher.server.delay=0.1
 fetcher.queue.mode=byHost

smaller domain
 fetcher.threads.fetch=100
 fetcher.threads.per.queue=100

bigger domain (as we wanted to see whether number of threads make a
difference)
 fetcher.threads.fetch=400
 fetcher.threads.per.queue=200

The performance looks surprisingly slow. Are we missing something ? Any
suggestion would be really appreciated.


Thanks
Srini