You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cesar Voulgaris <ce...@gmail.com> on 2007/06/11 03:24:14 UTC

crawling by ip range

Hi all, I have some problem for some time, I want to crawl only sites of my
country or related to it. The problem is
that crawling only by domain (in my case I set teh regex-urlfiter regex to
cath "(com|org|..).uy") lives out a lot of sites wich doesn,t end in .uy but

in .com .org, .... I don´t want to crawl to a certain depth and expand the
crawled pages outside the country. Is ther any clever method to crawl over a
range of ip´s
without touching the code?. If not, which plugin or extension point I have
to extend to consider such thing as ip checking for a gven url?

thanks in advance

Re: crawling by ip range

Posted by Enzo Michelangeli <en...@gmail.com>.
I have written a custom URLFilter that resolves the hostname into an IP 
address and checks the latter against a GeoIP database. Unfortunately the 
source code was developed under a commercial contract, and is not freely 
available.

Enzo

----- Original Message ----- 
From: "Cesar Voulgaris" <ce...@gmail.com>
To: "nutch user" <nu...@lucene.apache.org>
Sent: Monday, June 11, 2007 9:24 AM
Subject: crawling by ip range


Hi all, I have some problem for some time, I want to crawl only sites of my
country or related to it. The problem is
that crawling only by domain (in my case I set teh regex-urlfiter regex to
cath "(com|org|..).uy") lives out a lot of sites wich doesn,t end in .uy but

in .com .org, .... I don´t want to crawl to a certain depth and expand the
crawled pages outside the country. Is ther any clever method to crawl over a
range of ip´s
without touching the code?. If not, which plugin or extension point I have
to extend to consider such thing as ip checking for a gven url?

thanks in advance