You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/15 23:19:59 UTC

Not crawling certain directories.

One more question.. I'm using nutch-0.8.0 and trying to index a domain 
and want to exclude a certain directory from the crawl. In the 
crawl-urlfilter.txt I have defined the following:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
-^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy

However, the /yummy directory is still crawled. Any ideas as to what is 
going on? Thanks..
 Matt

Re: Not crawling certain directories.

Posted by Andrzej Bialecki <ab...@getopt.org>.

Matthew Holt wrote:
> One more question.. I'm using nutch-0.8.0 and trying to index a domain 
> and want to exclude a certain directory from the crawl. In the 
> crawl-urlfilter.txt I have defined the following:
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
> -^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy
>
> However, the /yummy directory is still crawled. Any ideas as to what 
> is going on? Thanks..

Rules are processed in order, and processing is terminated whenever a 
rule matches. Your first rule allows all subdirs. Just swap these two 
rules and all should be ok.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com