You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Terry Pothecary <te...@pothecary.com> on 2006/04/11 20:02:12 UTC
Crawling a large, finite set of sites.
Hi. I'm a relative novice with Nutch. I have a custom architecture that
I am finding difficult to support:
I Would like somone to explain to me some of the basics of Nutch
operation so That I can come up with a better solution to the one I have.
I am using Nutch to crawl a specific set of 500,000 named sites.
Each site has a set of tags that have to be included as fields when its
pages are indexed by Lucene.
So when I Seed the crawl tool with all the URLS, It takes forever to run
and then forever to index.
I would like some help to create a stable, continuously running system
that I can tweak by occasionaly adding / removing URLs. I also need the
index-and-use cycle to be every 24 hours. Initially the content of the
crawled database will be somewhat sparse but over time it will fill up
with successive depths of the 500,000 seed sites.
Please ask me any more questions you need in order to clarify this
situation, I'm not sure right now what information is relevant to your
understanding.
Thanks in advance.
David.