You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/02/21 08:16:37 UTC

[Nutch Wiki] Update of "NutchHadoopTutorial" by ChiaHungLin

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchHadoopTutorial" page has been changed by ChiaHungLin.
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=26&rev2=27

--------------------------------------------------

  scp -r /nutch/search/* nutch@computer:/nutch/search
  }}}
  
+ '''The main point is to copy nutch-* (under $nutch_home/conf) and crawl-urlfilter.txt files to $hadoop_home/conf folder so that hadoop cluster can pick up those configuration when startup. Otherwise hadoop cluster will complain with messages e.g. "0 records selected for fetching, exiting .. URLs to fetch - check your seed list and URL filters."'''
+ 
  Do this for every computer you want to use as a slave node.  Then edit the slaves file, adding each slave node name to the file, one per line.  You will also want to edit the hadoop-site.xml file and change the values for the map and reduce task numbers, making this a multiple of the number of machines you have.  For our system which has 6 data nodes I put in 32 as the number of tasks.  The replication property can also be changed at this time.  A good starting value is something like 2 or 3. *(see Note at bottom about possibly having to clear filesystem of new datanodes).   Once this is done you should be able to startup all of the nodes.
  
  To start all of the nodes we use the exact same command as before: