You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2012/03/11 18:26:06 UTC
[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=132&rev2=133
==== How do I index my local file system? ====
The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you '''have''' to change config files to get it to crawl your local disk.
- . 1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
+ . 1) regex-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
. Change this line: -^(file|ftp|mailto|https): to this: -^(http|ftp|mailto|https):
- 2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
+ 2) regex-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
. # accept anything else +.*
3) By default the protocol-file plugin is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: