You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2012/11/29 02:50:04 UTC
[Hadoop Wiki] Update of "QuickStart" by GlenMazza

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "QuickStart" page has been changed by GlenMazza:
http://wiki.apache.org/hadoop/QuickStart?action=diff&rev1=34&rev2=35

Comment:
Removed duplicate information already available on the Hadoop Site, providing links instead to that information (remaining Apache information should eventually be incorporated into the website.)

   * [[http://www.cloudera.com/hadoop-deb|Debian Packages for Debian based systems]] (Debian, Ubuntu, etc)
   * [[http://www.cloudera.com/hadoop-ec2|AMI for Amazon EC2]]
  
- If you want to work exclusively with Hadoop code directly from Apache, the rest of this document can help you get started quickly from there.
+ If you want to work exclusively with Hadoop code directly from Apache, the following articles from the website will be most useful:
+  * [[http://hadoop.apache.org/docs/stable/single_node_setup.html|Single-Node Setup]]
+  * [[http://hadoop.apache.org/docs/stable/cluster_setup.html|Cluster Setup]]
  
+ Note for the above Apache links, if you're having trouble getting "ssh localhost" to work on the following OS's:
- The instructions below are
- based on the docs found at the [[http://hadoop.apache.org/common/docs/current/cluster_setup.html#Configurationml | Hadoop Cluster Setup/Configuration]].
- 
- Please note the instructions were last updated to match Release 0.21.0. Things may have changed since then. If they have, please update this page.
- 
- == Requirements ==
-  * Java 1.6+ (see HadoopJavaVersions for 1.6.X version details)
-  * ssh and sshd
-  * rsync
- 
- == Preparatory Steps ==
- Download
- 
- '''Release Versions:'''
- can be found here http://hadoop.apache.org/core/releases.html
- 
- '''Subversion:'''
- First check that the currently build isn't borked
- http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/
- 
- Then grab the latest with subversion 
- {{{svn co http://svn.apache.org/repos/asf/hadoop/core/trunk hadoop}}}
- 
- 
- run the following commands:
- {{{
- cd hadoop
- ant 
- ant examples
- bin/hadoop
- }}}
- `bin/hadoop` should display the basic command line help docs and let you know it's at least basically working. If any of the above steps failed use subversion to roll back to an earlier days revision.
- 
- == Stage 1: Standalone Operation ==
- By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
- {{{
- mkdir input
- cp conf/*.xml input
- bin/hadoop jar hadoop-mapred-examples-0.21.0.jar grep input output 'security[a-z.]+'
- cat output/*
- }}}
- 
- Obviously the version number on the jar may have changed by the time you read this. You should see a lot of INFO level logging commands go by when you run it and cat output/* should give you something that looks like this:
- 
- {{{
- cat output/*
- 1	security.task.umbilical.protocol.acl
- 1	security.refresh.policy.protocol.acl
- 1	security.namenode.protocol.acl
- 1	security.job.submission.protocol.acl
- 1	security.inter.tracker.protocol.acl
- 1	security.inter.datanode.protocol.acl
- 1	security.datanode.protocol.acl
- ...(and so on)
- }}}
- 
- If you saw the error `Exception in thread "main" java.lang.NoClassDefFoundError: hadoop-mapred-examples-0/21/0/jar` it means you forgot to type `jar` after `bin/hadoop` If you were unable to run this example, roll back to a previous night's version. If it seemed to run fine but cat didn't spit anything out you probably mistyped something. Try copying the command directly from the wiki to avoid typos. You'll need to wipe out the output directory between each run.
- 
- Congratulations you have just successfully run your first MapReduce with Hadoop.
- 
- == Stage 2: Pseudo-distributed Configuration ==
- You can in fact run everything on a single host. To run things this way, put the following in `conf/hdfs-site.xml` (`conf/hadoop-site.xml` in versions < 0.20)
- {{{
- <configuration>
- 
-   <property>
-     <name>fs.default.name</name>
-     <value>localhost:9000</value>
-   </property>
- 
-   <property>
-     <name>mapred.job.tracker</name>
-     <value>localhost:9001</value>
-   </property>
- 
-   <property>
-     <name>dfs.replication</name>
-     <value>1</value>
- 	<!-- set to 1 to reduce warnings when 
- 	running on a single node -->
-   </property>
- 
- </configuration>
- }}}
- 
- Now check that the command 
- `ssh localhost`
- does not require a password. If it does, set up passwordless ssh. For example, you can execute the following commands:
- {{{
- ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
- cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
- }}}
- 
- Now, try `ssh localhost` again. If this doesn't work you're doing to have to figure out what's going on with your `ssh-agent` on your own.
  
  '''Window Users''' To start ssh server, you need run "ssh-host-config -y" in cygwin environment. If he ask for CYGWIN environment value, set it to "ntsec tty". After you can run server from cygwin "cygrunsrv --start sshd" or from Windows command line "net start sshd".
  
  '''Mac Users''' In recent versions of OSX, ssh-agent is already set up with launchd and keychain. This can be verified by executing "echo $SSH_AUTH_SOCK" in your favorite shell. You can use ssh-add -k and -K to add your keys and passphrases to your keychain.
  
+ Multi-node cluster setup is largely similar to single-node (pseudo-distributed) setup, except for the following:
- === Bootstrapping ===
- A new distributed filesystem must be formatted with the following command, run on the master node:
- 
- {{{bin/hadoop namenode -format}}}
- 
- If asked to [re]format, you must reply Y (not just y) if you want to reformat, else Hadoop will abort the format.
- 
- You should see a quick series of `STARTUP_MSG`s and a `SHUTDOWN_MSG`
- 
- Open the {{{conf/hadoop-env.sh}}} file and define {{{JAVA_HOME}}} in it.
- Then start up the Hadoop daemon with 
- 
- {{{bin/start-all.sh}}}
- 
- It should notify you that it's starting the `namenode`, `datanode`, `secondarynamenode`, and `jobtracker`. 
- 
- Input files are copied into the distributed filesystem as follows: 
- {{{bin/hadoop dfs -put <localsrc> <dst>}}}
- For more details just type `bin/hadoop dfs` with no options.
- 
- To shutdown:
- 
- {{{bin/stop-all.sh}}}
- 
- === Browsing to the Services ===
- 
- Once the Pseudo Distributed cluster is live, you can point your web browser at it, by connecting to localhost at the chosen ports. 
- If you have left the values at their defaults, the page PseudoDistributedHadoop provides short cuts to these pages. 
- 
- == Stage 3: Fully-distributed operation ==
- 
- Fully distributed operation is just like the pseudo-distributed operation described above, except, specify:
  
   1. The hostname or IP address of your master server in the value for fs.default.name, as hdfs://master.example.com/ in conf/core-site.xml.
   1. The host and port of the your master server in the value of mapred.job.tracker as master.example.com:port in conf/mapred-site.xml.
@@ -153, +31 @@

   1. mapred.map.tasks and mapred.reduce.tasks in conf/mapred-site.xml. As a rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 2x the number of slave processors for mapred.reduce.tasks.
   1. Finally, list all slave hostnames or IP addresses in your conf/slaves file, one per line. Then format your filesystem and start your cluster on your master node, as above.
  
- See [[http://hadoop.apache.org/common/docs/current/cluster_setup.html#Configurationml | Hadoop Cluster Setup/Configuration]] for details.
+ See [[http://hadoop.apache.org/common/docs/stable/cluster_setup.html#Configurationml | Hadoop Cluster Setup/Configuration]] for details.