You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2012/03/11 18:26:06 UTC

[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=132&rev2=133

  ==== How do I index my local file system? ====
  The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you '''have''' to change config files to get it to crawl your local disk.
  
-  . 1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
+  . 1) regex-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
    . Change this line: -^(file|ftp|mailto|https): to this: -^(http|ftp|mailto|https):
-  2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
+  2) regex-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
    . # accept anything else +.*
   3) By default the protocol-file plugin is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: