You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/06/22 09:59:58 UTC

[Nutch Wiki] Update of "FAQ" by JuhoMäkinen

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by JuhoMäkinen:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
Added Q/A How can I force fetcher to use custom nutch-default.xml and/or nutch-s

------------------------------------------------------------------------------
    * Use -numFetchers to generate multiple small segments.
    * Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb.
    * Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :)
+ 
+ '''How can I force fetcher to use custom nutch-config?
+   * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
+   * Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt
+   * Modify the nutch-default.xml to suite your needs
+   * Set NUTCH_CONF_DIR environment variable to point into the directory you created
+   * run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir.
+   * Happy using.
  
  
  ----