You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vladimir Loubenski <vl...@opentext.com> on 2016/10/05 18:09:47 UTC

Nutch scalability

Hi,
I have Nutch 2.3.1 installation with MongoDB.

I want to understand what scalability options I have.

1. Number threads during one Job can be defined by nutch-site.xml
	a. fetcher.threads.per.queue - This number is the maximum number of threads that should be allowed to access a queue at one time.
	b. fetcher.threads.fetch - The number of FetcherThreads the fetcher should use
Do we have other scalability configuration parameters?

2. Ability to run the same Job on different hosts.
	Does it supported by Nutch?
3.  Ability to run Jobs in parallel.
	Example: I run “fetch” job. It produces new not Crawled URLS. 
	Can I run another job to process these uncrawled URLS before the first Job is done?
4. Database scalability.
	Can I use multiple instances Mongo DB for crawling?

Thank you in advance,
Vladimir.