You are viewing a plain text version of this content. The canonical link for it is here.

Posted to agent@nutch.apache.org by Pierre-Luc Bacon <pi...@aqra.ca> on 2007/02/13 06:22:08 UTC

url filters

I wish to use Nutch so that it would crawl the urls contained into a
file (let say urls/urls.txt) but would stay only within these. I have
been using Nutch for a few weeks now but it bothers me to see that the
crawler goes visiting the ads on websites and indexes their content.
Most of the time, the crawler ends up analysing some content about
"free ipod, discount stuff and traveltoBananaIsland.com" related sites
while I'm not interested at all having those in the index.

I know that conf/crawl-urlfilter.txt could be used to that purpose but
I was wondering if there would be a single line in a conf file that
would turn a such feature on. I would prefer avoiding to do regexp and
just care about feeding the crawler plain urls.

Re: url filters

Posted by John Whelan <jo...@whelanlabs.com>.

Filtering would be one solution... You would set your filter creteria to
match your pages. Another approach is to set the traversal depth so that
only the primary pages (listed in your urls.txt file) are hit, and nothing
deeper is crawled.



Pierre-Luc Bacon wrote:
> 
> I wish to use Nutch so that it would crawl the urls contained into a
> file (let say urls/urls.txt) but would stay only within these. I have
> been using Nutch for a few weeks now but it bothers me to see that the
> crawler goes visiting the ads on websites and indexes their content.
> Most of the time, the crawler ends up analysing some content about
> "free ipod, discount stuff and traveltoBananaIsland.com" related sites
> while I'm not interested at all having those in the index.
> 
> I know that conf/crawl-urlfilter.txt could be used to that purpose but
> I was wondering if there would be a single line in a conf file that
> would turn a such feature on. I would prefer avoiding to do regexp and
> just care about feeding the crawler plain urls.
> 
> 

-- 
View this message in context: http://www.nabble.com/url-filters-tp8938763p22761674.html
Sent from the Nutch - Agent mailing list archive at Nabble.com.

Re: url filters

Posted by John Whelan <jo...@whelanlabs.com>.

Filtering would be one solution... You would set your filter creteria to
match your pages. Another approach is to set the traversal depth so that
only the primary pages (listed in your urls.txt file) are hit, and nothing
deeper is crawled.



Pierre-Luc Bacon wrote:
> 
> I wish to use Nutch so that it would crawl the urls contained into a
> file (let say urls/urls.txt) but would stay only within these. I have
> been using Nutch for a few weeks now but it bothers me to see that the
> crawler goes visiting the ads on websites and indexes their content.
> Most of the time, the crawler ends up analysing some content about
> "free ipod, discount stuff and traveltoBananaIsland.com" related sites
> while I'm not interested at all having those in the index.
> 
> I know that conf/crawl-urlfilter.txt could be used to that purpose but
> I was wondering if there would be a single line in a conf file that
> would turn a such feature on. I would prefer avoiding to do regexp and
> just care about feeding the crawler plain urls.
> 
> 

-- 
View this message in context: http://www.nabble.com/url-filters-tp8938763p22761671.html
Sent from the Nutch - Agent mailing list archive at Nabble.com.