You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/12/13 11:56:09 UTC

Re: [Nutch-general] index filesystem

Hi,
check the nutch-default.xml you need to include first the plugin and  
also all parser you want to use,
Than check that you may not exclude files you plan to crawl in your  
url filter setup.
Than you can simply crawl your file system by starting with a folder  
url.

Stefan
Am 13.12.2005 um 09:59 schrieb palombo@cli.di.unipi.it:

> I need of your help.
> I want make the index of a directory in filessystem.
> What I can modificed?
> Maybe the file crawl-urlfilter.txt ?
> Only this file?
> I write the file "crawl-urlfilter.txt" in this mode:
>
> # Creative Commnons crawl filter
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz| 
> rpm|tgz|mov|MOV|exe)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[*!@]
> #+[?&=%]
> -[?*!@=]
>
> #URLs VALIDE
>
> +^file:///usr/Proventi2/([a-z0-9]*\.)/
>
> # accept anything else
> +.*
>
> it is ok? what I do?
> please answer me, it is very important for me!
> help help!!!
>
>
>                Adriano Palombo
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through  
> log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD  
> SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>