You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Christophe Noel <ch...@gmail.com> on 2005/08/05 10:57:00 UTC

Ignore external links from crawled domains

Hello,

A very basic facility seem to be missing in Nutch. If I have a 2000 urls 
list in Nutch DB and want to ignore external links, I have to build a 
regex-filter with thousands of different domain I want to crawl. No 
parameter to only crawl the different domain and ignore external links.

At these times, is there another solution ? Has anybody worked on that ?

Thank you very much.

Christophe Noël.



Re: Ignore external links from crawled domains

Posted by Ken Krugler <kk...@transpac.com>.
>A very basic facility seem to be missing in Nutch. If I have a 2000 
>urls list in Nutch DB and want to ignore external links, I have to 
>build a regex-filter with thousands of different domain I want to 
>crawl. No parameter to only crawl the different domain and ignore 
>external links.
>
>At these times, is there another solution ? Has anybody worked on that ?

We did something similar, though not exactly the same.

We've got a list of "favored domains", and we use this to boost link 
scores in the FetchListTool before sorting and selecting the topN. So 
you could easily apply the same approach to strip out any URLs that 
aren't in your domain set.

Another approach that I haven't tried would be to set the external 
link weight (db.score.link.external) to 0. So any new page added by a 
link that's "leaving" a domain effectively get a score of 0. Two 
problems I can think of are (a) if you have a link between pages from 
two of your target domains, this might cause problems, and (b) 
without mods to FetchListTool you still might wind up fetching a page 
with a score of 0.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200