You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/07/31 03:26:37 UTC

Re: how to exclude some external links

On Thu, Jul 30, 2009 at 9:15 PM, <al...@aim.com> wrote:

> I would like to know how can I modify nutch code to exclude external links with certain extensions. For example, if have in urls mydomain.com and my domain.com has a lot of links like mydomain.com/mylink.shtml, then I want nutch not to fetch(crawl) these kind of urls at all.

Can't you do this with the existing RegexURLFilter plugin?  Make sure
urlfilter-regex is listed in plugin.includes, and that you've got the
property urlfilter.regex.file is set to a file (probably
regex-urlfilter.txt).  Then you can list the extensions you want to
skip in that file.

-- 
http://www.linkedin.com/in/paultomblin

Re: how to exclude some external links

Posted by al...@aim.com.
 


 Hi,

The plugin is enabled in nutch-default.xml file, but changes in it did not affect search. Instead changes in crawl-urlfilter.txt takes changes fetched links.

Thanks.
Alex.


 

-----Original Message-----
From: Paul Tomblin <pt...@xcski.com>
To: nutch-user@lucene.apache.org
Sent: Thu, Jul 30, 2009 6:26 pm
Subject: Re: how to exclude some external links










On Thu, Jul 30, 2009 at 9:15 PM, <al...@aim.com> wrote:

> I would like to know how can I modify nutch code to exclude external links 
with certain extensions. For example, if have in urls mydomain.com and my 
domain.com has a lot of links like mydomain.com/mylink.shtml, then I want nutch 
not to fetch(crawl) these kind of urls at all.

Can't you do this with the existing RegexURLFilter plugin?  Make sure
urlfilter-regex is listed in plugin.includes, and that you've got the
property urlfilter.regex.file is set to a file (probably
regex-urlfilter.txt).  Then you can list the extensions you want to
skip in that file.

-- 
http://www.linkedin.com/in/paultomblin