You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/03/31 20:31:33 UTC

[Nutch Wiki] Update of "NutchHadoopTutorial" by DennisKubes

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NutchHadoopTutorial

The comment on the change is:
Added rsyncing code to slave nodes and removed unnecessary env variables

------------------------------------------------------------------------------
  the hadoop-env.sh file:
  
  {{{
- NUTCH_HOME=/nutch/search
- HADOOP_HOME=/nutch/search
+ export HADOOP_HOME=/nutch/search
- 
- JAVA_HOME=/usr/java/jdk1.5.0_06
+ export JAVA_HOME=/usr/java/jdk1.5.0_06
- NUTCH_JAVA_HOME=${JAVA_HOME}
- 
- NUTCH_LOG_DIR=${HADOOP_HOME}/logs
+ export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
- 
- NUTCH_MASTER=devcluster01
- HADOOP_MASTER=devcluster01
- HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
+ export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
  }}}
  
  There are other variables in this file which will affect the behavior of Hadoop.  If when you start running the script later you start getting ssh errors, try changing the HADOOP_SSH_OPTS variable.  Note also that, after the initial copy, you can set NUTCH_MASTER in your conf/hadoop-env.sh and it will use rsync to update the code running on each slave when you start daemons on that slave.
@@ -418, +411 @@

  bin/start-all.sh
  }}}
  
- A command like 'bin/slaves.sh uptime' is a good way to test that things are configured correctly before attempting to call the start-all.sh script.
+ '''A command like 'bin/slaves.sh uptime' is a good way to test that things are configured correctly before attempting to call the start-all.sh script.'''
  
  The first time all of the nodes are started there may be the ssh dialog asking to add the hosts to the known_hosts file.  You will have to type in yes for each one and hit enter.  The output may be a little wierd the first time but just keep typing yes and hitting enter if the dialogs keep appearing.  You should see output showing all the servers starting on the local machine and the job tracker and data nodes servers starting on the slave nodes.  Once this is complete we are ready to begin our crawl.
  
@@ -497, +490 @@

  Then point you browser to http://devcluster01:8080 (your master node) to see the Nutch search web application.  If everything has been configured correctly then you should be able to enter queries and get results.
  
  
+ == Rsyncing Code to Slaves ==
+ --------------------------------------------------------------------------------
+ Nutch and Hadoop provide the ability to rsync master changes to the slave nodes.  This is optional though because it slows down the startup of the servers and because you might not want to have changed automatically synced to slave nodes.  
+ 
+ If you do want this capability enabled then below I will show you how to configure your servers to rsync from the master.  There are a couple of things you should know first.  One, even though the slave nodes can rsync from the master you still have to copy the base installation over to the slave node the first time so that the scripts are available to rsync.  This is the way we did it above so that shouldn't require any changeds  Two the way the rsync happens is that the master node does an ssh into the slave node and calls bin/hadoop-daemon.sh.  The script on the slave node then calls the rsync back to the master node.  What this means is that you have to have a password-less login from each of the slave nodes to the master node.  Before we setup password-less login from the master to the slaves, now we need to do the reverse.  Three, if you have problems with the rsync options (I did and I had to change the options because I am running an older version of ssh), look in t
 he bin/hadoop-daemon.sh script around line 82 for where it calls the rsync command.  
+ 
+ So the first thing we need to do is setup the hadoop master variable in the conf/hadoop-env.sh file.  The variable will need to look like this:
+ 
+ {{{
+ export HADOOP_MASTER=devcluster01:/nutch/search
+ }}}
+ 
+ This will need to be copied to all of the slave nodes like this:
+ 
+ {{{
+ scp /nutch/home/conf/hadoop-env.sh nutch@devcluster02:/nutch/home/conf/hadoop-env.sh
+ }}}
+ 
+ And finally you will need to log into each of the slave nodes, create a default ssh key for each machine and then copy it back to the master node where you will append it to the /nutch/home/.ssh/authorized_keys file.  Here are the commands for each slave node, be sure to change the slavenodename when you copy the key file back to the master node so you don't overwrite files:
+ 
+ {{{
+ ssh -l nutch devcluster02
+ cd /nutch/home/.ssh
+ 
+ ssh-keygen -t rsa (Use empty responses for each prompt)
+   Enter passphrase (empty for no passphrase): 
+   Enter same passphrase again: 
+   Your identification has been saved in /nutch/home/.ssh/id_rsa.
+   Your public key has been saved in /nutch/home/.ssh/id_rsa.pub.
+   The key fingerprint is:
+   a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc nutch@localhost
+ 
+ scp id_rsa.pub nutch@devcluster01:/nutch/home/devcluster02.pub
+ }}}
+ 
+ Once you have done that for each of the slave nodes you can append the files to the authorized_keys file on the master node:
+ 
+ {{{
+ cd /nutch/home
+ cat devcluster*.pub >> .ssh/authorized_keys
+ }}}
+ 
+ With this setup whenever you run the bin/start-all.sh script files should be synced from the master node to each of the slave nodes.
+ 
+ 
  == Conclusion ==
  --------------------------------------------------------------------------------
  I know this has been a lengthy tutorial but hopefully it has gotten you familiar with both nutch and hadoop.  Both Nutch and Hadoop are complicated applications and setting them up as you have learned is not necessarily an easy task.  I hope that this document has helped to make it easier for you.