You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cesar Voulgaris <ce...@gmail.com> on 2007/06/11 03:24:14 UTC
crawling by ip range
Hi all, I have some problem for some time, I want to crawl only sites of my
country or related to it. The problem is
that crawling only by domain (in my case I set teh regex-urlfiter regex to
cath "(com|org|..).uy") lives out a lot of sites wich doesn,t end in .uy but
in .com .org, .... I don´t want to crawl to a certain depth and expand the
crawled pages outside the country. Is ther any clever method to crawl over a
range of ip´s
without touching the code?. If not, which plugin or extension point I have
to extend to consider such thing as ip checking for a gven url?
thanks in advance
Re: crawling by ip range
Posted by Enzo Michelangeli <en...@gmail.com>.
I have written a custom URLFilter that resolves the hostname into an IP
address and checks the latter against a GeoIP database. Unfortunately the
source code was developed under a commercial contract, and is not freely
available.
Enzo
----- Original Message -----
From: "Cesar Voulgaris" <ce...@gmail.com>
To: "nutch user" <nu...@lucene.apache.org>
Sent: Monday, June 11, 2007 9:24 AM
Subject: crawling by ip range
Hi all, I have some problem for some time, I want to crawl only sites of my
country or related to it. The problem is
that crawling only by domain (in my case I set teh regex-urlfiter regex to
cath "(com|org|..).uy") lives out a lot of sites wich doesn,t end in .uy but
in .com .org, .... I don´t want to crawl to a certain depth and expand the
crawled pages outside the country. Is ther any clever method to crawl over a
range of ip´s
without touching the code?. If not, which plugin or extension point I have
to extend to consider such thing as ip checking for a gven url?
thanks in advance