You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2007/01/01 07:29:07 UTC
Re: how to crawl Specified type files?
You can use prefix and suffix filters by making sure the plugin.includes
variable in the nutch-*.xml file has the urlfilters configured with the
urlfilter variable like so:
urlfilter-(prefix|suffix)...
Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt
files in the conf directory. Below is a configuration that only crawls
pages that begin with the http protocol and ignores many different file
types by suffix. On the prefix only these types are accepted. On the
suffix we start by allowing everything and then specifically deny
certain file types.
Dennis
# prefix-urlfilter.txt file starts here
http
# prefix-urlfilter.txt file ends here
# suffix-urlfilter.txt file starts here
# case-insensitive, allow unknown suffixes
+I
# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
# suffix-urlfilter.txt file ends here
fangky@gzedu.gov.cn wrote:
> hi
>
> I want to know whether nutch can be set to crawl specified type files and specified name files?
>
> for example: If I crawl a website that contains many document files , and I want nutch only crawl pdf and doc files but not html files,how to do?
>
> and another question is can I want nutch only to crawl specified name files like index.htm or so ?
>
> thanks in advance
>
>