You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Vanderdray, Jake" <JV...@aarp.org> on 2005/09/20 20:20:27 UTC

Fetching FAQ

	I was reading through the FAQ and had a follow-up to one of the
questions on there.  Here's what's on the FAQ:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is it possible to fetch only pages from some specific domains?

Please have a look on PrefixURLFilter. Adding some regular expressions
to the urlfilter.regex.file might work, but adding a list with thousands
of regular expressions would slow down your system excessively.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	I see the urlfilter.prefix.file entry in conf/nutch-default.xml,
but don't see any corresponding file (regex-urlfilter.txt).  Am I just
missing it, or does it need to be created from scratch.  If the later,
what is the format?  I'll update the FAQ with the answers.

Thanks,
Jake.

Re: Fetching FAQ

Posted by Gal Nitzan <gn...@usa.net>.

Vanderdray, Jake wrote:
> 	I was reading through the FAQ and had a follow-up to one of the
> questions on there.  Here's what's on the FAQ:
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Is it possible to fetch only pages from some specific domains?
>
> Please have a look on PrefixURLFilter. Adding some regular expressions
> to the urlfilter.regex.file might work, but adding a list with thousands
> of regular expressions would slow down your system excessively.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 	I see the urlfilter.prefix.file entry in conf/nutch-default.xml,
> but don't see any corresponding file (regex-urlfilter.txt).  Am I just
> missing it, or does it need to be created from scratch.  If the later,
> what is the format?  I'll update the FAQ with the answers.
>
> Thanks,
> Jake.
>
> .
>
>   
Hi,

You are probably missing it (or mistakenly deleted it) since it is a 
part of the tar or zip file.
look for conf/regex-urlfilter.txt

Gal.

Here is the default file contents (so create if if it doesn't exist
##############################################
# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept anything else
+.
############################## End