You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nils Hoeller <ni...@arcor.de> on 2005/08/10 16:09:00 UTC
Setting the url filter on demand, crawling just a certain domain
which will be defined at runtime
Hi,
my last big question is:
I have understand crawling in that way, that I can
define the space of crawling, so that the
crawler only visits and index site
of a certain domain:
for example +^http://([a-z0-9]*\.)*apache.org/
can be put into the crawl-urlfilter.txt
But this is defined before runtime.
At my application, a user can add a
url like www.ifis.uni-luebeck.de
and then crawl index and graph the domain.
The Problem is that when I crawl in a depth,
also other sites like www.uni-luebeck.de
and www.xyz.de will be crawled and apear in
the graph.
But I just want to have a sitemap
of urls with .ifis.uni-luebeck.de as the domain.
Adding
+^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/
to the crawl-urlfilter.txt helps
but only for this url.
The urls are defined during runtime, so
theres the problem.
How can this be solved?
What can I do during runtime, so that
only subpages of the same domain will
be crawled and put into the db?
Thanks for you help
Nils