You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Terry Pothecary <te...@pothecary.com> on 2006/04/11 20:02:12 UTC

Crawling a large, finite set of sites.

Hi. I'm a relative novice with Nutch. I have a custom architecture that 
I am finding difficult to support:

I Would like somone to explain to me some of the basics of Nutch 
operation so That I can come up with a better solution to the one I have.

I am using Nutch to crawl a specific set of 500,000 named sites.
Each site has a set of tags that have to be included as fields when its 
pages are indexed by Lucene.

So when I Seed the crawl tool with all the URLS, It takes forever to run 
and then forever to index.

I would like some help to create a stable, continuously running system 
that I can tweak by occasionaly adding / removing URLs. I also need the 
index-and-use cycle to be every 24 hours. Initially the content of the 
crawled database will be somewhat sparse but over time it will fill up 
with successive depths of the 500,000 seed sites.

Please ask me any more questions you need in order to clarify this 
situation, I'm not sure right now what information is relevant to your 
understanding.


Thanks in advance.
David.