You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/05/05 19:06:06 UTC
[Hadoop Wiki] Update of "AmazonEC2" by ChrisWensel

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by ChrisWensel:
http://wiki.apache.org/hadoop/AmazonEC2

The comment on the change is:
Update for changes on contrib/ec2 for version 0.17

------------------------------------------------------------------------------
  This document assumes that you have already followed the steps in [http://docs.amazonwebservices.com/AmazonEC2/gsg/2007-01-03/ Amazon's Getting Started Guide]. In particular, you should have run through the sections "Setting up an Account", "Setting up the Tools" and the "Generating a Keypair" section of "Running an Instance".
  
  Note that the older, manual step-by-step guide to getting Hadoop running on EC2 can be found [http://wiki.apache.org/lucene-hadoop/AmazonEC2?action=recall&rev=10 here].
+ 
+ '''Version 0.17''' of Hadoop includes a few changes that provide support for multiple simultaneous clusters, provide quicker startup times for large clusters, and includes a pre-configured Ganglia installation. These differences are noted below.
  
  == Preliminaries ==
  
@@ -27, +29 @@

  
  Clusters of Hadoop instances are created in a security group. Instances within the group have unfettered access to one another. Machines outside the group (such as your workstation), can only access instance on port 22 (for SSH), port 50030 (for the JobTracker's web interface, permitting one to view job status), and port 50060 (for the TaskTracker's web interface, for more detailed debugging).
  
- Hadoop requires slave nodes to be able to establish SSH connections to the master node (and vice versa). This is achieved after the cluster has launched by copying the EC2 private key to all machines in the cluster.
+ ('''Pre Hadoop 0.17''') These EC2 scripts require slave nodes to be able to establish SSH connections to the master node (and vice versa). This is achieved after the cluster has launched by copying the EC2 private key to all machines in the cluster.
  
  == Setting up ==
   * Unpack the latest Hadoop distribution on your system (version 0.12.0 or later).
@@ -43, +45 @@

  % ec2-describe-images -x all | grep hadoop
  }}}
       * The default value for `S3_BUCKET` (`hadoop-ec2-images`) is for public images. You normally only need to change this if you want to use a private image you have built yourself.      
-    * Hadoop cluster variables (`GROUP`, `MASTER_HOST`, `NO_INSTANCES`)
+    * ('''Pre 0.17''') Hadoop cluster variables (`GROUP`, `MASTER_HOST`, `NO_INSTANCES`)
       * `GROUP` specifies the private group to run the cluster in. Typically the default value is fine.
       * `MASTER_HOST` is the hostname of the master node in the cluster. You need to set this to be a hostname that you have DNS control over - it needs resetting every time a cluster is launched. Services such as [http://www.dyndns.com/services/dns/dyndns/ DynDNS] and [http://developer.amazonwebservices.com/connect/thread.jspa?messageID=61609#61609 the like] make this fairly easy.
       * `NO_INSTANCES` sets the number of instances in your cluster. You need to set this. Currently Amazon limits the number of concurrent instances to 20.
  
- == Running a job on a cluster ==
+ == Running a job on a cluster (Pre 0.17) ==
   * Open a command prompt in ''src/contrib/ec2''.
   * Launch a EC2 cluster and start Hadoop with the following command. During execution of this script you will be prompted to set up DNS. {{{
  % bin/hadoop-ec2 run
@@ -70, +72 @@

  % bin/hadoop-ec2 terminate
  }}}
  
+ == Running a job on a cluster (0.17) ==
+  * Open a command prompt in ''src/contrib/ec2''.
+  * Launch a EC2 cluster and start Hadoop with the following command. You must supply a cluster name (test-cluster) and the number of slaves (2). After the cluster boots, the public DNS name will be printed to the console. {{{
+ % bin/hadoop-ec2 launch-cluster test-cluster 2
+ }}}
+  * You can login to the master node from your workstation by typing: {{{
+ % bin/hadoop-ec2 login test-cluster
+ }}}
+  * You will then be logged into the master node where you can start your job.
+    * For example, to test your cluster, try {{{
+ # cd /usr/local/hadoop-*
+ # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
+ }}}
+  * You can check progress of your job at `http://<MASTER_HOST>:50030/`. Where MASTER_HOST is the host name returned after the cluster started, above.
+  * When you have finished, shutdown the cluster with the following:{{{
+ % bin/hadoop-ec2 terminate-cluster test-cluster
+ }}}
+  * Keep in mind that the master node is started first and configured, then all slaves nodes are booted simultaneously with boot parameters pointing to the master node. Even though the `lauch-cluster` command has returned, the whole cluster may not have yet 'booted'. You should monitor the cluster via port 50030 to make sure all nodes are up. 
+ 
- == Troubleshooting ==
+ == Troubleshooting (Pre 0.17) ==
  Running Hadoop on EC2 involves a high level of configuration, so it can take a few goes to get the system working for your particular set up.
  
  If you are having problems with the Hadoop EC2 `run` command then you can run the following in turn, which have the same effect but may help you to see where the problem is occurring: {{{
@@ -81, +102 @@

  Currently, the scripts don't have much in the way of error detection or handling. If a script produces an error, then you may need to use the Amazon EC2 tools for interacting with instances directly - for example, to shutdown an instance that is mis-configured.
  
  Another technique for debugging is to manually run the scripts line-by-line until the error occurs. If you have feedback or suggestions, or need help then please use the Hadoop mailing lists.
+ 
+ == Troubleshooting (0.17) ==
+ Running Hadoop on EC2 involves a high level of configuration, so it can take a few goes to get the system working for your particular set up.
+ 
+ If you are having problems with the Hadoop EC2 `launch-cluster` command then you can run the following in turn, which have the same effect but may help you to see where the problem is occurring: {{{
+ % bin/hadoop-ec2 launch-master <cluster-name>
+ % bin/hadoop-ec2 launch-slaves <cluster-name> <num slaves>
+ }}}
+ 
+ Note you can call the `launch-slaves` command as many times as necessary to grow your cluster. Shrinking a cluster is more tricky and should be done by hand (after balancing file replications etc).
+ 
+ To browse all your nodes via a web browser, starting at the 50030 status page, start the following command in a new shell window: {{{
+ % bin/hadoop-ec2 proxy <cluster-name>
+ }}}
+ 
+ This command will start a SOCKS tunnel through your master node, and print out all the URLs you can reach from you web browser. For this to work, you must configure your browser to send requests over SOCKS to the local proxy on port 6666. The FireFox plugin FoxyProxy is great for this.
+ 
+ Currently, the scripts don't have much in the way of error detection or handling. If a script produces an error, then you may need to use the Amazon EC2 tools for interacting with instances directly - for example, to shutdown an instance that is mis-configured.
+ 
+ Another technique for debugging is to manually run the scripts line-by-line until the error occurs. If you have feedback or suggestions, or need help then please use the Hadoop mailing lists.
+ 
+ If you are finding that all your nodes are not showing up, you can point your browser to the Ganglia status page for your cluster at `http://<MASTER_HOST>/ganglia/`, after starting the `proxy` command.
  
  == Building your own Hadoop image ==
  The public images should be sufficient for most needs, however there are circumstances where you would like to build your own images, perhaps because an image with the version of Hadoop you want isn't available (an older version, the latest trunk version, or a patched version), or because you want to run extra software on your instances.
@@ -103, +146 @@

     * AMI selection (`HADOOP_VERSION`, `S3_BUCKET`)
       * When creating an AMI, `HADOOP_VERSION` is used to select which version of Hadoop to download and install from http://www.apache.org/dist/lucene/hadoop/.
       * Change `S3_BUCKET` to be a bucket you own that you want to store the Hadoop AMI in.
+    * ('''0.17''') AMI size selection (`INSTANCE_TYPE`)
+      * When creating an AMI, `INSTANCE_TYPE` denotes the instance size the image will be run on (small, large, or xlarge). Ultimately this decides if the image is `i386` or `x86_64`, so this value is also used on cluster startup.
     * Java variables
       * `JAVA_BINARY_URL` is the download URL for a Sun JDK. Visit the [http://java.sun.com/javase/downloads/index.jsp Sun Java downloads page], select a recent stable JDK, and get the URL for the JDK (not JRE) labelled "Linux self-extracting file".
       * `JAVA_VERSION` is the version number of the JDK to be installed.