You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by jim shirreffs <jp...@verizon.net> on 2007/04/05 22:06:22 UTC
Help please trying to crawl local file system
I googled and googled and goolged I am trying to crawl my local file system
and can't seem to get it right.
I use this command
bin/mutch crawl urls -dir crawl
My urls dir contains one file (files) that looks like this
file:///c:/joms
c:/joms exists
I've modified the config file crawl-urlfilter.txt
#-^(file|ftp|mailto|sw|swf):
-^(http|ftp|mailto|sw|swf):
# skip everything else ..... web spaces
#-.
+.*
And the config file nutch-site.xml adding
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
</configuration>
And lastly I've modified regex-urlfilter.txt
#file systems
+^file:///c:/top/directory/
-.
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept anything else
+.
I don't get any errors but nothing gets crawled either. If anyone can point
out my mistake(s) I would greatly appreciate it.
thanks in advance
jim s
ps it would also be nice to know this email is getting into the nutch-users
mailing list
Re: Help please trying to crawl local file system
Posted by Dennis Kubes <nu...@dragonflymc.com>.
Did you set the agent name in the nutch configuration. I think even
when crawling only the local file system the agent name still needs to
be set. If not set I believe nothing is fetched and errors are thrown
but you would only see this if your logging was setup for it.
Dennis Kubes
jim shirreffs wrote:
> I googled and googled and goolged I am trying to crawl my local file
> system and can't seem to get it right.
>
> I use this command
>
> bin/mutch crawl urls -dir crawl
>
> My urls dir contains one file (files) that looks like this
>
> file:///c:/joms
>
> c:/joms exists
>
> I've modified the config file crawl-urlfilter.txt
>
> #-^(file|ftp|mailto|sw|swf):
> -^(http|ftp|mailto|sw|swf):
>
> # skip everything else ..... web spaces
> #-.
> +.*
>
>
> And the config file nutch-site.xml adding
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
>
> </property>
> <property>
> <name>file.content.limit</name>
> <value>-1</value>
> </property>
> </configuration>
>
>
> And lastly I've modified regex-urlfilter.txt
> #file systems
> +^file:///c:/top/directory/
> -.
>
> # skip file: ftp: and mailto: urls
> #-^(file|ftp|mailto):
> -^(http|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
>
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept anything else
> +.
>
>
> I don't get any errors but nothing gets crawled either. If anyone can
> point out my mistake(s) I would greatly appreciate it.
>
> thanks in advance
>
> jim s
>
>
> ps it would also be nice to know this email is getting into the
> nutch-users mailing list
>
>
>
>