You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hong Li <ce...@gmail.com> on 2006/03/18 06:22:44 UTC

2 Questions of Nutch usage

Hi all,

I'd used nutch 0.7.1 with tomcat 4.1.13 to index our own website. It's quite
amazing to see nutch using UTF8 to support web pages in Chinese so well that
needn't to do any extra setting! Here are two questions I current facing and
I appreciate any feedback since my search of this mail archives doesn't find
related answer.


1. Everyday our website has about 1000 new web pages published. After using
command:

 bin/nutch crawl abc.com/nutch.url -dir abc.com/crawl

to fetch and index all our existing webpages, how can I setup nutch in
crontab to automatically fetch new added pages? I'd realized that I can only
run above command once and will receive error message saying the crawl
directory already exist if I want to run the 2nd time without deleting the
crawl directory.

2.  Every web page has several information sections that could be organized
by <div id=xxx>..</div>. Now is it possible to setup nutch index program to
only index contents within certain <div></dive> or other pattern? for
example, a webpage may has 10000 characters but I only want to index the
1000 characters in the middle of the web page which has most meaningful
content.


Any advice is appreciated,

Li