You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by yl...@ifrance.com,
yl...@ifrance.com on 2007/01/12 15:16:15 UTC
problems to exclude subdirectories in a web site
Hello,
I want to exclude for indexing subdirectories in a website
and i have not found the goods parameters.
I use Nutch-0.7.2 because it is impossible
for me to index with Nutch-0.8.1 (it crash).
I want to exclude in my website the subdirectories :
/de/*
/en/*
/fr/mv/*
I try the command line
-^http://toto.web-site.net/de/([a-z0-9]*)
and
-^http://toto.web-site.net/de/*
in my crawl-urlfilter.txt file but
they don't work and nutch index these url but i don't want this.
Any idea ?
I have the default regex-urlfilter.txt
and my personnal crawl-urlfilter.txt is:
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.� The first matching pattern in the file
# determines whether a URL is included or ignored.� If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/
# Website hostname for indexing
+^http://toto.web-site.net
# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)
# skip everything else
-.
*********** my default regex-urlfilter.txt file is **************
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.� The first matching pattern in the file
# determines whether a URL is included or ignored.� If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# accept anything else
+.
________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com
Re: problems to exclude subdirectories in a web site
Posted by Alvaro Cabrerizo <to...@gmail.com>.
Try to write your excluding patterns before accepting patterns. If I'm not
wrong nutch follows the order of the patterns. So it first check
+^http://toto.web-site.net <http://site.net/>adding all the urls you want
to skip with -^http://toto.web-
<http://site.net/>site.net/de/([a-z0-9]*)...<http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29>
Then your crawl-urlfilter.txt or regex-urlfilter.txt should look like this:
...
# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)<http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29>
# Website hostname for indexing
+^http://toto.web-site.net
# skip everything else
-.
Hope it helps.
2007/1/12, yleny @ ifrance. com <yl...@ifrance.com>:
>
> Hello,
>
> I want to exclude for indexing subdirectories in a website
> and i have not found the goods parameters.
> I use Nutch-0.7.2 because it is impossible
> for me to index with Nutch-0.8.1 (it crash).
>
> I want to exclude in my website the subdirectories :
> /de/*
> /en/*
> /fr/mv/*
>
> I try the command line
> -^http://toto.web-site.net/de/([a-z0-9]*)
> and
> -^http://toto.web-site.net/de/*
> in my crawl-urlfilter.txt file but
> they don't work and nutch index these url but i don't want this.
> Any idea ?
>
> I have the default regex-urlfilter.txt
> and my personnal crawl-urlfilter.txt is:
>
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in
> the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/
>
> # Website hostname for indexing
> +^http://toto.web-site.net
>
> # URL to exclude for indexing
> -^http://toto.web-site.net/de/([a-z0-9]*)
> -^http://toto.web-site.net/en/([a-z0-9]*)
> -^http://toto.web-site.net/fr/mv/([a-z0-9]*)
>
> # skip everything else
> -.
>
>
> *********** my default regex-urlfilter.txt file is **************
>
> # The default url filter.
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in
> the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept anything else
> +.
> ________________________________________________________________________
> iFRANCE, exprimez-vous !
> http://web.ifrance.com
>
>