You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by yl...@ifrance.com, yl...@ifrance.com on 2007/01/12 15:16:15 UTC

problems to exclude subdirectories in a web site

Hello,

I want to exclude for indexing subdirectories in a website
and i have not found the goods parameters.
I use Nutch-0.7.2 because it is impossible
for me to index with Nutch-0.8.1 (it crash).

I want to exclude in my website the subdirectories :
/de/*
/en/*
/fr/mv/*

I try the command line
-^http://toto.web-site.net/de/([a-z0-9]*)
and
-^http://toto.web-site.net/de/*
in my crawl-urlfilter.txt file but
they don&#39;t work and nutch index these url but i don&#39;t want this.
Any idea ?

I have the default regex-urlfilter.txt
and my personnal crawl-urlfilter.txt is:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by &#39;+&#39; or &#39;-&#39;.� The first matching pattern in the file
# determines whether a URL is included or ignored.� If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can&#39;t yet parse
-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/

# Website hostname for indexing
+^http://toto.web-site.net

# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)

# skip everything else
-.


*********** my default regex-urlfilter.txt file is **************

# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by &#39;+&#39; or &#39;-&#39;.� The first matching pattern in the file
# determines whether a URL is included or ignored.� If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can&#39;t yet parse
-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept anything else
+.
________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com

Re: problems to exclude subdirectories in a web site

Posted by Alvaro Cabrerizo <to...@gmail.com>.

Try to write your excluding patterns before accepting patterns. If I'm not
wrong nutch follows the order of the patterns. So  it first check
+^http://toto.web-site.net  <http://site.net/>adding all the urls you want
to skip with -^http://toto.web-
<http://site.net/>site.net/de/([a-z0-9]*)...<http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29>

Then your crawl-urlfilter.txt or regex-urlfilter.txt should look like this:

...
# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)<http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29>

# Website hostname for indexing
+^http://toto.web-site.net

# skip everything else
-.


Hope it helps.

2007/1/12, yleny @ ifrance. com <yl...@ifrance.com>:
>
> Hello,
>
> I want to exclude for indexing subdirectories in a website
> and i have not found the goods parameters.
> I use Nutch-0.7.2 because it is impossible
> for me to index with Nutch-0.8.1 (it crash).
>
> I want to exclude in my website the subdirectories :
> /de/*
> /en/*
> /fr/mv/*
>
> I try the command line
> -^http://toto.web-site.net/de/([a-z0-9]*)
> and
> -^http://toto.web-site.net/de/*
> in my crawl-urlfilter.txt file but
> they don&#39;t work and nutch index these url but i don&#39;t want this.
> Any idea ?
>
> I have the default regex-urlfilter.txt
> and my personnal crawl-urlfilter.txt is:
>
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by &#39;+&#39; or &#39;-&#39;. The first matching pattern in
> the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can&#39;t yet parse
>
> -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/
>
> # Website hostname for indexing
> +^http://toto.web-site.net
>
> # URL to exclude for indexing
> -^http://toto.web-site.net/de/([a-z0-9]*)
> -^http://toto.web-site.net/en/([a-z0-9]*)
> -^http://toto.web-site.net/fr/mv/([a-z0-9]*)
>
> # skip everything else
> -.
>
>
> *********** my default regex-urlfilter.txt file is **************
>
> # The default url filter.
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by &#39;+&#39; or &#39;-&#39;. The first matching pattern in
> the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can&#39;t yet parse
>
> -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept anything else
> +.
> ________________________________________________________________________
> iFRANCE, exprimez-vous !
> http://web.ifrance.com
>
>