You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nils Hoeller <ni...@arcor.de> on 2005/08/10 16:09:00 UTC

Setting the url filter on demand, crawling just a certain domain which will be defined at runtime

Hi,

my last big question is:

I have understand crawling in that way, that I can 
define the space of crawling, so that the 
crawler only visits and index site
of a certain domain:

for example +^http://([a-z0-9]*\.)*apache.org/
can be put into the crawl-urlfilter.txt

But this is defined before runtime.

At my application, a user can add a 
url like www.ifis.uni-luebeck.de
and then crawl index and graph the domain.

The Problem is that when I crawl in a depth, 
also other sites like www.uni-luebeck.de 
and www.xyz.de will be crawled and apear in 
the graph.

But I just want to have a sitemap
of urls with .ifis.uni-luebeck.de as the domain.

Adding
+^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/
to the crawl-urlfilter.txt helps
but only for this url.

The urls are defined during runtime, so 
theres the problem.

How can this be solved?
What can I do during runtime, so that
only subpages of the same domain will
be crawled and put into the db?

Thanks for you help

Nils