You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Mohammad Tariq <do...@gmail.com> on 2012/11/03 09:58:58 UTC

Re: Hadoop - cluster set-up (for DUMMIES)... or how I did it

Hello Andy,

        Thank you  for sharing your experience with us. I would just like
to add that it is always good to include "dfs.name.dir" and "dfs.data.dir"
properties in hdfs-site.xml file to make sure that everything runs smoothly
as /tmp gets emptied at each restart. So, there are always chances of
loosing the data and meta info. Also, t's good to add "hadoop.tmp.dir" in
core-site.xml as it also default to /tmp.

Regards,
    Mohammad Tariq



On Fri, Nov 2, 2012 at 10:05 PM, Kartashov, Andy <An...@mpac.ca>wrote:

> Hello Hadoopers,
>
> After weeks of struggle, numerous error debugging and the like I finally
> managed to set-up a fully distributed cluster. I decided to share my
> experience with the new comers.
>  In case the experts on here disagree with some of the facts mentioned
> here-in feel free to correct or add your comments.
>
> Example Cluster Topology:
> Node 1 – NameNode+JobTracker
> Node 2 – SecondaryNameNode
> Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N
>
> Configuration set-up after you installed Hadoop:
>
> Firstly, you will need to find every host address of your respective Node
> by running:
> $hostname –f
>
> Your /etc/hadoop/ folder contains subfolders of your configuration files.
>  Your installation will create a default folder conf.empty. Copy it to, say
> conf.cluster and make sure your soft link conf-> points to conf.cluster
>
> You can see what it points now to by running:
> $ alternatives --display hadoop-conf
>
> Make a new link and set it to point to conf.cluster:
> $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf
> /etc/hadoop/conf.cluster 50
> $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
> Run the display again to check proper configuration
> $ alternatives --display hadoop-conf
>
> Let’s go inside conf.cluster
> $cd conf.cluster/
>
> As a minimum, we will need to modify the following files:
> 1.      core-site.xml
> <property>
>   <name>fs.defaultFS</name>
>     <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your
> NameNode -Node1 which you found with “hostname –f” above
>   </property>
>
> 2.      mapred-site.xml
>   <property>
>     <name>mapred.job.tracker</name>
>     <!--<value><host-name>:8021</value> --> # it is host-name of your
> NameNode – Node 1  as well, since we intend to run NameNode and JobTracker
> on the same machine
>     <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
>   </property>
>
> 3.      masters # if this file doesn’t exist yet, create it and add one
> line:
> <host-name> # it is the host-name of your Node2 – running SecondaryNameNode
>
> 4.      slaves # if this file doesn’t exist yet, create it and add your
> host-names ( one per line):
> <host-name> # it is the host-name of your Node3 – running DataNode1
> <host-name> # it is the host-name of your Node4 – running DataNode2
> ….
> <host-name> # it is the host-name of your NodeN – running DataNodeN
>
>
> 5.      If you are not comfortable touching hdfs-site.xml, no problem,
> after you format your NameNode, it will create dfs/name dfs/data etc.
> folder structure in your local Linux default /tmp/hadoop-hdfs/directory.
> You could later change this to a different folder by specifying
> hdfs-site.xml  but please learn on the file structure/permissions/owners of
> those directories /dfs/data dfs/name dfs/namesecondary etc that were
> created for you by default first.
>
> Let’s format HDFS namespace: (note we format it as hdfs user)
> $ sudo –u hdfs hadoop  namenode –format
> NOTE – that you only run this command ONCE on the NameNode only!
>
> I only added the following property to my hdfs-site.xml on the NameNode-
> Node1 for the SecondaryNameNode to use:
>
> <property>
>   <name>dfs.namenode.http-address</name>
>   <value>namenode.host.address:50070</value>   # I change this to
> 0.0.0.0:50070 for EC2 environment
>   <description>
>     Needed for running SNN
>     The address and the base port on which the dfs NameNode Web UI will
> listen.
>     If the port is 0, the server will start on a free port.
>   </description>
> </property>other SNN properties for hdfs-site.xml
>
> 6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2
> (SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis
> directory (see above).
>
> 7.              Now we ready to start daemons:
>
>         Everytime you start a respective Daemon, a log report is written.
>  This is the FIRST place to look for potential problems.
> Unless you change the property in hadoop-env.sh, found in your
> /conf/conf.cluster/ directory, namely “#export
> HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each
> respective Node to:
> NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
> JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or
> “/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your
> MR. Respective Daemon will have a respective filename ending with .log
>
>                 I came across a lot of errors playing with this, as
> follows:
> a.      Error: connection refused
> This is normally caused by your firewall. Try running “sudo
> /etc/init.d/iptables status”.  I bet it is running. Solution: either add
> allowed ports or temporarily turn off iptables by running “sudo service
> iptables stop”
> Try to restart your Daemon (that is refused connection) and check your
> respective /var/log/…. Datanode or TaskTracker or else .log file again.
> This solved my problems with connections. You can test connection by
> running  “telnet <ip-address> <port>” of the Node you are trying to connect
> to.
> b.      Binding exception.
> This happens when you try to start a Daemon on the machine that is not
> supposed to run this Daemon. For example,  trying to start JobTracker on a
> slave machine.  This is a given.  JobTracker is already running on your
> MasterNode -  Node1 hence the binding Exception.
> c.      Java heap size or Java Child exception were thrown when I ran too
> small of an instance on EC2. Increasing it from tiny to small or from small
> to medium, solved the issue.
> d.      DataNode running on slave throws an Exception about DataNode id
> –mismatch. This happened when I tried to duplicate an instance on EC2, and
> as a result ended up with two different DataNodes with the same id.
> Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and
> restarting dataNode Daemon solved this issue.
> Now, that you fixed your above errors and restarted respective Daemons
> your ..log files should be clean of any errors.
>
> Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered
> with the Master Namenode - Node1.
> “$hadoop dfsadmin –printTopology”
> Should display all your host-addresses you mentioned in the
> /conf.cluster/slaves file.
>
>
> 8.      Let’s create some structure inside hdfs:
>  Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is
> up and running
> $ sudo -u hdfs hadoop fs -mkdir /tmp
> $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
>
>  Create MapReduce /var directories (YARN requires different structure)
> sudo -u hdfs hadoop fs -mkdir /var
> sudo -u hdfs hadoop fs -mkdir /var/lib
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
> sudo -u hdfs hadoop fs -mkdir
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chmod 1777
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
>
> Verify the HDFS File Structure
> $ sudo -u hdfs hadoop fs -ls -R /
>
> You should see:
> drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs/cache
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19
> /var/lib/hadoop-hdfs/cache/mapred
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29
> /var/lib/hadoop-hdfs/cache/mapred/mapred
> drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
>
> Create a Home Directory for each MapReduce User
> Create a home directory for each MapReduce user. It is best to do this on
> the NameNode; for example:
> $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
> where <user> is the Linux username of each user.
>
>
> p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
> 1. check your firewall (iptables) if running and what ports are allowed
> 2. add hostname (by running "$hostname -f") inside your
> /conf/conf.cluster/slaves on NameNode1 ONLY!
> 3. start DataNode + TaskTracker on the newly added Node
> 4. restart DataNode / JobTracker on your NameNode1
> 5. Check that your DataNode registered by running "hadoop dfsadmin
> -printTopology".
> 6. If I am duplicating an instance on EC2 currently running DataNode,
> before I start above two Daemons I make sure I delete  data inside
> /var/log/hadoop-hdfs, /var/log/hadoop-mapreduce and /tmp/hadoop-hdfs
> folders. Starting DataNode and TaskTracker Daemon will recreate new files
> afresh.
>
> Happy Hadooping.
> NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>