You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2011/12/02 02:24:01 UTC

how to filter outlinks

Hello,

I wondered if there is a possibility to filter outlinks based on a regex. For example, seed list has domains with all extensions, like .com, .net , .org and etc. Is it possible to put filter on outlinks from seed list, not to fetch domains with .net extension, without excluding those in the seed list. 
I could use urlfilter-domain plugin, but as far as I understood it will apply regex to domains in the original seed file and to their inlinks.

Thanks.
Alex.