You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2007/01/01 07:29:07 UTC

Re: how to crawl Specified type files?

You can use prefix and suffix filters by making sure the plugin.includes 
variable in the nutch-*.xml file has the urlfilters configured with the 
urlfilter variable like so:

urlfilter-(prefix|suffix)...

Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt 
files in the conf directory. Below is a configuration that only crawls 
pages that begin with the http protocol and ignores many different file 
types by suffix. On the prefix only these types are accepted. On the 
suffix we start by allowing everything and then specifically deny 
certain file types.

Dennis

# prefix-urlfilter.txt file starts here
http
# prefix-urlfilter.txt file ends here

# suffix-urlfilter.txt file starts here
# case-insensitive, allow unknown suffixes
+I
# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
# suffix-urlfilter.txt file ends here

fangky@gzedu.gov.cn wrote:
> hi
>
> I want to know whether nutch can be set to crawl specified type files and specified name files?
>
> for example: If I crawl a website that contains many document files , and I want nutch only crawl pdf and doc files but not html files,how to do?
>
> and another question is can I want nutch only to crawl specified name files like index.htm or so ?
>
> thanks in advance
>
>