You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by suraj shrestha <su...@yahoo.com> on 2011/09/26 00:21:13 UTC

How to disable pdf crawling but show pdf links as outlinks

Right now, I am using regex-urlfilter.txt to disable pdf crawling. However, I want to be able to see  the pdf links when I generate read link db (bin/nutch readlinkdb).
Is there a crawl-filter that I can customize, so that crawl request to the pdf url is ignored or should I update Fetcher?
Thanks.