You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by jim shirreffs <jp...@verizon.net> on 2007/04/05 22:06:22 UTC

Help please trying to crawl local file system

I googled and googled and goolged I am trying to crawl my local file system 
and can't seem to get it right.

I use this command

bin/mutch crawl urls -dir crawl

My urls dir contains one file (files) that looks like this

file:///c:/joms

c:/joms exists

I've modified the config file crawl-urlfilter.txt

#-^(file|ftp|mailto|sw|swf):
-^(http|ftp|mailto|sw|swf):

# skip everything else ..... web spaces
#-.
+.*


And the config file nutch-site.xml adding

<property>
  <name>plugin.includes</name>
  <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
</property>
<property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>
</configuration>


And lastly I've modified regex-urlfilter.txt
#file systems
+^file:///c:/top/directory/
-.

# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.


I don't get any errors but nothing gets crawled either. If anyone can point 
out my mistake(s) I would greatly appreciate it.

thanks in advance

jim s


ps it would also be nice to know this email is getting into the nutch-users 
mailing list

Re: Help please trying to crawl local file system

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Did you set the agent name in the nutch configuration.  I think even 
when crawling only the local file system the agent name still needs to 
be set.  If not set I believe nothing is fetched and errors are thrown 
but you would only see this if your logging was setup for it.

Dennis Kubes

jim shirreffs wrote:
> I googled and googled and goolged I am trying to crawl my local file 
> system and can't seem to get it right.
> 
> I use this command
> 
> bin/mutch crawl urls -dir crawl
> 
> My urls dir contains one file (files) that looks like this
> 
> file:///c:/joms
> 
> c:/joms exists
> 
> I've modified the config file crawl-urlfilter.txt
> 
> #-^(file|ftp|mailto|sw|swf):
> -^(http|ftp|mailto|sw|swf):
> 
> # skip everything else ..... web spaces
> #-.
> +.*
> 
> 
> And the config file nutch-site.xml adding
> 
> <property>
>  <name>plugin.includes</name>
>  <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value> 
> 
> </property>
> <property>
>  <name>file.content.limit</name>
>  <value>-1</value>
> </property>
> </configuration>
> 
> 
> And lastly I've modified regex-urlfilter.txt
> #file systems
> +^file:///c:/top/directory/
> -.
> 
> # skip file: ftp: and mailto: urls
> #-^(file|ftp|mailto):
> -^(http|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ 
> 
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break 
> loops
> -.*(/.+?)/.*?\1/.*?\1/
> 
> # accept anything else
> +.
> 
> 
> I don't get any errors but nothing gets crawled either. If anyone can 
> point out my mistake(s) I would greatly appreciate it.
> 
> thanks in advance
> 
> jim s
> 
> 
> ps it would also be nice to know this email is getting into the 
> nutch-users mailing list
> 
> 
> 
>