You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manish Verma <m_...@apple.com> on 2016/01/28 01:14:38 UTC

Filter Urls Only At Generation Time Or Fetch Time

Hi,

I am using Nutch 1.10 and we are planing to crawl just some url which match some pattern. 
The problem is we can not do it using regex-urlfilter.txt as this way the seeds itself would be rejected.

For e.g seed is apple.com <http://apple.com/> and we want to crawl just urls which has /mac/ in url string. May be we have to filter the urls at Generate or fetch time .
Any thoughts ? Can we customize Generate or Fetch phases ?

Thanks
Manish Verma