You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Doğacan Güney <do...@gmail.com> on 2008/10/01 09:49:31 UTC

Re: Ignoring a url in the crawl

On Mon, Sep 29, 2008 at 9:17 PM, sangeet <sr...@gmail.com> wrote:
>
> I'm having a hard time trying to avoid crawling a particular url.
> In regex-urlfilter.txt I added the following to ignore it.
> -^http://([a-z0-9]*\.)*bhejacry.com/forums/
>
> This url is not in the list in my urls directory. I also have
> 'db.ignore.external.links' set to 'true'.
>
> However, I still see the following during the crawl
>
> fetching
> http://www.bhejacry.com/forums/memberlist.php?mode=viewprofile&u=2774
> fetching http://www.bhejacry.com/forums/memberlist.php?mode=viewprofile&u=96
>
> How do I ignore these urls?

Try running
bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter

Then simply type your url. If a url is filtered, it will be output
back with a "-" at the beginning.

(You will need the patch from NUTCH-654 . Or wait a couple of hours
and I will commit it)

> --
> View this message in context: http://www.nabble.com/Ignoring-a-url-in-the-crawl-tp19729031p19729031.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney