You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/15 23:19:59 UTC
Not crawling certain directories.
One more question.. I'm using nutch-0.8.0 and trying to index a domain
and want to exclude a certain directory from the crawl. In the
crawl-urlfilter.txt I have defined the following:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
-^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy
However, the /yummy directory is still crawled. Any ideas as to what is
going on? Thanks..
Matt
Re: Not crawling certain directories.
Posted by Andrzej Bialecki <ab...@getopt.org>.
Matthew Holt wrote:
> One more question.. I'm using nutch-0.8.0 and trying to index a domain
> and want to exclude a certain directory from the crawl. In the
> crawl-urlfilter.txt I have defined the following:
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
> -^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy
>
> However, the /yummy directory is still crawled. Any ideas as to what
> is going on? Thanks..
Rules are processed in order, and processing is terminated whenever a
rule matches. Your first rule allows all subdirs. Just swap these two
rules and all should be ok.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com