You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by "Kartashov, Andy" <An...@mpac.ca> on 2012/11/02 17:35:35 UTC

Hadoop - cluster set-up (for DUMMIES)... or how I did it

Hello Hadoopers,

After weeks of struggle, numerous error debugging and the like I finally managed to set-up a fully distributed cluster. I decided to share my experience with the new comers.
 In case the experts on here disagree with some of the facts mentioned here-in feel free to correct or add your comments.

Example Cluster Topology:
Node 1 – NameNode+JobTracker
Node 2 – SecondaryNameNode
Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N

Configuration set-up after you installed Hadoop:

Firstly, you will need to find every host address of your respective Node by running:
$hostname –f

Your /etc/hadoop/ folder contains subfolders of your configuration files.  Your installation will create a default folder conf.empty. Copy it to, say conf.cluster and make sure your soft link conf-> points to conf.cluster

You can see what it points now to by running:
$ alternatives --display hadoop-conf

Make a new link and set it to point to conf.cluster:
$ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cluster 50
$ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
Run the display again to check proper configuration
$ alternatives --display hadoop-conf

Let’s go inside conf.cluster
$cd conf.cluster/

As a minimum, we will need to modify the following files:
1.      core-site.xml
<property>
  <name>fs.defaultFS</name>
    <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your NameNode -Node1 which you found with “hostname –f” above
  </property>

2.      mapred-site.xml
  <property>
    <name>mapred.job.tracker</name>
    <!--<value><host-name>:8021</value> --> # it is host-name of your NameNode – Node 1  as well, since we intend to run NameNode and JobTracker on the same machine
    <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
  </property>

3.      masters # if this file doesn’t exist yet, create it and add one line:
<host-name> # it is the host-name of your Node2 – running SecondaryNameNode

4.      slaves # if this file doesn’t exist yet, create it and add your host-names ( one per line):
<host-name> # it is the host-name of your Node3 – running DataNode1
<host-name> # it is the host-name of your Node4 – running DataNode2
….
<host-name> # it is the host-name of your NodeN – running DataNodeN


5.      If you are not comfortable touching hdfs-site.xml, no problem, after you format your NameNode, it will create dfs/name dfs/data etc. folder structure in your local Linux default /tmp/hadoop-hdfs/directory. You could later change this to a different folder by specifying hdfs-site.xml  but please learn on the file structure/permissions/owners of those directories /dfs/data dfs/name dfs/namesecondary etc that were created for you by default first.

Let’s format HDFS namespace: (note we format it as hdfs user)
$ sudo –u hdfs hadoop  namenode –format
NOTE – that you only run this command ONCE on the NameNode only!

I only added the following property to my hdfs-site.xml on the NameNode- Node1 for the SecondaryNameNode to use:

<property>
  <name>dfs.namenode.http-address</name>
  <value>namenode.host.address:50070</value>   # I change this to 0.0.0.0:50070 for EC2 environment
  <description>
    Needed for running SNN
    The address and the base port on which the dfs NameNode Web UI will listen.
    If the port is 0, the server will start on a free port.
  </description>
</property>other SNN properties for hdfs-site.xml

6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2 (SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis directory (see above).

7.              Now we ready to start daemons:

        Everytime you start a respective Daemon, a log report is written.  This is the FIRST place to look for potential problems.
Unless you change the property in hadoop-env.sh, found in your /conf/conf.cluster/ directory, namely “#export HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each respective Node to:
NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or “/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your MR. Respective Daemon will have a respective filename ending with .log

                I came across a lot of errors playing with this, as follows:
a.      Error: connection refused
This is normally caused by your firewall. Try running “sudo /etc/init.d/iptables status”.  I bet it is running. Solution: either add allowed ports or temporarily turn off iptables by running “sudo service iptables stop”
Try to restart your Daemon (that is refused connection) and check your respective /var/log/…. Datanode or TaskTracker or else .log file again.
This solved my problems with connections. You can test connection by running  “telnet <ip-address> <port>” of the Node you are trying to connect to.
b.      Binding exception.
This happens when you try to start a Daemon on the machine that is not supposed to run this Daemon. For example,  trying to start JobTracker on a slave machine.  This is a given.  JobTracker is already running on your MasterNode -  Node1 hence the binding Exception.
c.      Java heap size or Java Child exception were thrown when I ran too small of an instance on EC2. Increasing it from tiny to small or from small to medium, solved the issue.
d.      DataNode running on slave throws an Exception about DataNode id –mismatch. This happened when I tried to duplicate an instance on EC2, and as a result ended up with two different DataNodes with the same id. Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and restarting dataNode Daemon solved this issue.
Now, that you fixed your above errors and restarted respective Daemons your ..log files should be clean of any errors.

Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered with the Master Namenode - Node1.
“$hadoop dfsadmin –printTopology”
Should display all your host-addresses you mentioned in the /conf.cluster/slaves file.


8.      Let’s create some structure inside hdfs:
 Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is up and running
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

 Create MapReduce /var directories (YARN requires different structure)
sudo -u hdfs hadoop fs -mkdir /var
sudo -u hdfs hadoop fs -mkdir /var/lib
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Verify the HDFS File Structure
$ sudo -u hdfs hadoop fs -ls -R /

You should see:
drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Create a Home Directory for each MapReduce User
Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:
$ sudo -u hdfs hadoop fs -mkdir  /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.


p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
1. check your firewall (iptables) if running and what ports are allowed
2. add hostname (by running "$hostname -f") inside your /conf/conf.cluster/slaves on NameNode1 ONLY!
3. start DataNode + TaskTracker on the newly added Node
4. restart DataNode / JobTracker on your NameNode1
5. Check that your DataNode registered by running "hadoop dfsadmin -printTopology".
6. If I am duplicating an instance on EC2 currently running DataNode, before I start above two Daemons I make sure I delete  data inside /var/log/hadoop-hdfs, /var/log/hadoop-mapreduce and /tmp/hadoop-hdfs folders. Starting DataNode and TaskTracker Daemon will recreate new files afresh.

Happy Hadooping.
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel

Re: Hadoop - cluster set-up (for DUMMIES)... or how I did it

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Andy,

        Thank you  for sharing your experience with us. I would just like
to add that it is always good to include "dfs.name.dir" and "dfs.data.dir"
properties in hdfs-site.xml file to make sure that everything runs smoothly
as /tmp gets emptied at each restart. So, there are always chances of
loosing the data and meta info. Also, t's good to add "hadoop.tmp.dir" in
core-site.xml as it also default to /tmp.

Regards,
    Mohammad Tariq



On Fri, Nov 2, 2012 at 10:05 PM, Kartashov, Andy <An...@mpac.ca>wrote:

> Hello Hadoopers,
>
> After weeks of struggle, numerous error debugging and the like I finally
> managed to set-up a fully distributed cluster. I decided to share my
> experience with the new comers.
>  In case the experts on here disagree with some of the facts mentioned
> here-in feel free to correct or add your comments.
>
> Example Cluster Topology:
> Node 1 – NameNode+JobTracker
> Node 2 – SecondaryNameNode
> Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N
>
> Configuration set-up after you installed Hadoop:
>
> Firstly, you will need to find every host address of your respective Node
> by running:
> $hostname –f
>
> Your /etc/hadoop/ folder contains subfolders of your configuration files.
>  Your installation will create a default folder conf.empty. Copy it to, say
> conf.cluster and make sure your soft link conf-> points to conf.cluster
>
> You can see what it points now to by running:
> $ alternatives --display hadoop-conf
>
> Make a new link and set it to point to conf.cluster:
> $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf
> /etc/hadoop/conf.cluster 50
> $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
> Run the display again to check proper configuration
> $ alternatives --display hadoop-conf
>
> Let’s go inside conf.cluster
> $cd conf.cluster/
>
> As a minimum, we will need to modify the following files:
> 1.      core-site.xml
> <property>
>   <name>fs.defaultFS</name>
>     <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your
> NameNode -Node1 which you found with “hostname –f” above
>   </property>
>
> 2.      mapred-site.xml
>   <property>
>     <name>mapred.job.tracker</name>
>     <!--<value><host-name>:8021</value> --> # it is host-name of your
> NameNode – Node 1  as well, since we intend to run NameNode and JobTracker
> on the same machine
>     <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
>   </property>
>
> 3.      masters # if this file doesn’t exist yet, create it and add one
> line:
> <host-name> # it is the host-name of your Node2 – running SecondaryNameNode
>
> 4.      slaves # if this file doesn’t exist yet, create it and add your
> host-names ( one per line):
> <host-name> # it is the host-name of your Node3 – running DataNode1
> <host-name> # it is the host-name of your Node4 – running DataNode2
> ….
> <host-name> # it is the host-name of your NodeN – running DataNodeN
>
>
> 5.      If you are not comfortable touching hdfs-site.xml, no problem,
> after you format your NameNode, it will create dfs/name dfs/data etc.
> folder structure in your local Linux default /tmp/hadoop-hdfs/directory.
> You could later change this to a different folder by specifying
> hdfs-site.xml  but please learn on the file structure/permissions/owners of
> those directories /dfs/data dfs/name dfs/namesecondary etc that were
> created for you by default first.
>
> Let’s format HDFS namespace: (note we format it as hdfs user)
> $ sudo –u hdfs hadoop  namenode –format
> NOTE – that you only run this command ONCE on the NameNode only!
>
> I only added the following property to my hdfs-site.xml on the NameNode-
> Node1 for the SecondaryNameNode to use:
>
> <property>
>   <name>dfs.namenode.http-address</name>
>   <value>namenode.host.address:50070</value>   # I change this to
> 0.0.0.0:50070 for EC2 environment
>   <description>
>     Needed for running SNN
>     The address and the base port on which the dfs NameNode Web UI will
> listen.
>     If the port is 0, the server will start on a free port.
>   </description>
> </property>other SNN properties for hdfs-site.xml
>
> 6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2
> (SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis
> directory (see above).
>
> 7.              Now we ready to start daemons:
>
>         Everytime you start a respective Daemon, a log report is written.
>  This is the FIRST place to look for potential problems.
> Unless you change the property in hadoop-env.sh, found in your
> /conf/conf.cluster/ directory, namely “#export
> HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each
> respective Node to:
> NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
> JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or
> “/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your
> MR. Respective Daemon will have a respective filename ending with .log
>
>                 I came across a lot of errors playing with this, as
> follows:
> a.      Error: connection refused
> This is normally caused by your firewall. Try running “sudo
> /etc/init.d/iptables status”.  I bet it is running. Solution: either add
> allowed ports or temporarily turn off iptables by running “sudo service
> iptables stop”
> Try to restart your Daemon (that is refused connection) and check your
> respective /var/log/…. Datanode or TaskTracker or else .log file again.
> This solved my problems with connections. You can test connection by
> running  “telnet <ip-address> <port>” of the Node you are trying to connect
> to.
> b.      Binding exception.
> This happens when you try to start a Daemon on the machine that is not
> supposed to run this Daemon. For example,  trying to start JobTracker on a
> slave machine.  This is a given.  JobTracker is already running on your
> MasterNode -  Node1 hence the binding Exception.
> c.      Java heap size or Java Child exception were thrown when I ran too
> small of an instance on EC2. Increasing it from tiny to small or from small
> to medium, solved the issue.
> d.      DataNode running on slave throws an Exception about DataNode id
> –mismatch. This happened when I tried to duplicate an instance on EC2, and
> as a result ended up with two different DataNodes with the same id.
> Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and
> restarting dataNode Daemon solved this issue.
> Now, that you fixed your above errors and restarted respective Daemons
> your ..log files should be clean of any errors.
>
> Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered
> with the Master Namenode - Node1.
> “$hadoop dfsadmin –printTopology”
> Should display all your host-addresses you mentioned in the
> /conf.cluster/slaves file.
>
>
> 8.      Let’s create some structure inside hdfs:
>  Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is
> up and running
> $ sudo -u hdfs hadoop fs -mkdir /tmp
> $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
>
>  Create MapReduce /var directories (YARN requires different structure)
> sudo -u hdfs hadoop fs -mkdir /var
> sudo -u hdfs hadoop fs -mkdir /var/lib
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
> sudo -u hdfs hadoop fs -mkdir
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chmod 1777
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
>
> Verify the HDFS File Structure
> $ sudo -u hdfs hadoop fs -ls -R /
>
> You should see:
> drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs/cache
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19
> /var/lib/hadoop-hdfs/cache/mapred
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29
> /var/lib/hadoop-hdfs/cache/mapred/mapred
> drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
>
> Create a Home Directory for each MapReduce User
> Create a home directory for each MapReduce user. It is best to do this on
> the NameNode; for example:
> $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
> where <user> is the Linux username of each user.
>
>
> p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
> 1. check your firewall (iptables) if running and what ports are allowed
> 2. add hostname (by running "$hostname -f") inside your
> /conf/conf.cluster/slaves on NameNode1 ONLY!
> 3. start DataNode + TaskTracker on the newly added Node
> 4. restart DataNode / JobTracker on your NameNode1
> 5. Check that your DataNode registered by running "hadoop dfsadmin
> -printTopology".
> 6. If I am duplicating an instance on EC2 currently running DataNode,
> before I start above two Daemons I make sure I delete  data inside
> /var/log/hadoop-hdfs, /var/log/hadoop-mapreduce and /tmp/hadoop-hdfs
> folders. Starting DataNode and TaskTracker Daemon will recreate new files
> afresh.
>
> Happy Hadooping.
> NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: Hadoop - cluster set-up (for DUMMIES)... or how I did it

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Andy,

        Thank you  for sharing your experience with us. I would just like
to add that it is always good to include "dfs.name.dir" and "dfs.data.dir"
properties in hdfs-site.xml file to make sure that everything runs smoothly
as /tmp gets emptied at each restart. So, there are always chances of
loosing the data and meta info. Also, t's good to add "hadoop.tmp.dir" in
core-site.xml as it also default to /tmp.

Regards,
    Mohammad Tariq



On Fri, Nov 2, 2012 at 10:05 PM, Kartashov, Andy <An...@mpac.ca>wrote:

> Hello Hadoopers,
>
> After weeks of struggle, numerous error debugging and the like I finally
> managed to set-up a fully distributed cluster. I decided to share my
> experience with the new comers.
>  In case the experts on here disagree with some of the facts mentioned
> here-in feel free to correct or add your comments.
>
> Example Cluster Topology:
> Node 1 – NameNode+JobTracker
> Node 2 – SecondaryNameNode
> Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N
>
> Configuration set-up after you installed Hadoop:
>
> Firstly, you will need to find every host address of your respective Node
> by running:
> $hostname –f
>
> Your /etc/hadoop/ folder contains subfolders of your configuration files.
>  Your installation will create a default folder conf.empty. Copy it to, say
> conf.cluster and make sure your soft link conf-> points to conf.cluster
>
> You can see what it points now to by running:
> $ alternatives --display hadoop-conf
>
> Make a new link and set it to point to conf.cluster:
> $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf
> /etc/hadoop/conf.cluster 50
> $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
> Run the display again to check proper configuration
> $ alternatives --display hadoop-conf
>
> Let’s go inside conf.cluster
> $cd conf.cluster/
>
> As a minimum, we will need to modify the following files:
> 1.      core-site.xml
> <property>
>   <name>fs.defaultFS</name>
>     <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your
> NameNode -Node1 which you found with “hostname –f” above
>   </property>
>
> 2.      mapred-site.xml
>   <property>
>     <name>mapred.job.tracker</name>
>     <!--<value><host-name>:8021</value> --> # it is host-name of your
> NameNode – Node 1  as well, since we intend to run NameNode and JobTracker
> on the same machine
>     <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
>   </property>
>
> 3.      masters # if this file doesn’t exist yet, create it and add one
> line:
> <host-name> # it is the host-name of your Node2 – running SecondaryNameNode
>
> 4.      slaves # if this file doesn’t exist yet, create it and add your
> host-names ( one per line):
> <host-name> # it is the host-name of your Node3 – running DataNode1
> <host-name> # it is the host-name of your Node4 – running DataNode2
> ….
> <host-name> # it is the host-name of your NodeN – running DataNodeN
>
>
> 5.      If you are not comfortable touching hdfs-site.xml, no problem,
> after you format your NameNode, it will create dfs/name dfs/data etc.
> folder structure in your local Linux default /tmp/hadoop-hdfs/directory.
> You could later change this to a different folder by specifying
> hdfs-site.xml  but please learn on the file structure/permissions/owners of
> those directories /dfs/data dfs/name dfs/namesecondary etc that were
> created for you by default first.
>
> Let’s format HDFS namespace: (note we format it as hdfs user)
> $ sudo –u hdfs hadoop  namenode –format
> NOTE – that you only run this command ONCE on the NameNode only!
>
> I only added the following property to my hdfs-site.xml on the NameNode-
> Node1 for the SecondaryNameNode to use:
>
> <property>
>   <name>dfs.namenode.http-address</name>
>   <value>namenode.host.address:50070</value>   # I change this to
> 0.0.0.0:50070 for EC2 environment
>   <description>
>     Needed for running SNN
>     The address and the base port on which the dfs NameNode Web UI will
> listen.
>     If the port is 0, the server will start on a free port.
>   </description>
> </property>other SNN properties for hdfs-site.xml
>
> 6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2
> (SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis
> directory (see above).
>
> 7.              Now we ready to start daemons:
>
>         Everytime you start a respective Daemon, a log report is written.
>  This is the FIRST place to look for potential problems.
> Unless you change the property in hadoop-env.sh, found in your
> /conf/conf.cluster/ directory, namely “#export
> HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each
> respective Node to:
> NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
> JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or
> “/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your
> MR. Respective Daemon will have a respective filename ending with .log
>
>                 I came across a lot of errors playing with this, as
> follows:
> a.      Error: connection refused
> This is normally caused by your firewall. Try running “sudo
> /etc/init.d/iptables status”.  I bet it is running. Solution: either add
> allowed ports or temporarily turn off iptables by running “sudo service
> iptables stop”
> Try to restart your Daemon (that is refused connection) and check your
> respective /var/log/…. Datanode or TaskTracker or else .log file again.
> This solved my problems with connections. You can test connection by
> running  “telnet <ip-address> <port>” of the Node you are trying to connect
> to.
> b.      Binding exception.
> This happens when you try to start a Daemon on the machine that is not
> supposed to run this Daemon. For example,  trying to start JobTracker on a
> slave machine.  This is a given.  JobTracker is already running on your
> MasterNode -  Node1 hence the binding Exception.
> c.      Java heap size or Java Child exception were thrown when I ran too
> small of an instance on EC2. Increasing it from tiny to small or from small
> to medium, solved the issue.
> d.      DataNode running on slave throws an Exception about DataNode id
> –mismatch. This happened when I tried to duplicate an instance on EC2, and
> as a result ended up with two different DataNodes with the same id.
> Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and
> restarting dataNode Daemon solved this issue.
> Now, that you fixed your above errors and restarted respective Daemons
> your ..log files should be clean of any errors.
>
> Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered
> with the Master Namenode - Node1.
> “$hadoop dfsadmin –printTopology”
> Should display all your host-addresses you mentioned in the
> /conf.cluster/slaves file.
>
>
> 8.      Let’s create some structure inside hdfs:
>  Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is
> up and running
> $ sudo -u hdfs hadoop fs -mkdir /tmp
> $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
>
>  Create MapReduce /var directories (YARN requires different structure)
> sudo -u hdfs hadoop fs -mkdir /var
> sudo -u hdfs hadoop fs -mkdir /var/lib
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
> sudo -u hdfs hadoop fs -mkdir
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chmod 1777
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
>
> Verify the HDFS File Structure
> $ sudo -u hdfs hadoop fs -ls -R /
>
> You should see:
> drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs/cache
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19
> /var/lib/hadoop-hdfs/cache/mapred
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29
> /var/lib/hadoop-hdfs/cache/mapred/mapred
> drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
>
> Create a Home Directory for each MapReduce User
> Create a home directory for each MapReduce user. It is best to do this on
> the NameNode; for example:
> $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
> where <user> is the Linux username of each user.
>
>
> p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
> 1. check your firewall (iptables) if running and what ports are allowed
> 2. add hostname (by running "$hostname -f") inside your
> /conf/conf.cluster/slaves on NameNode1 ONLY!
> 3. start DataNode + TaskTracker on the newly added Node
> 4. restart DataNode / JobTracker on your NameNode1
> 5. Check that your DataNode registered by running "hadoop dfsadmin
> -printTopology".
> 6. If I am duplicating an instance on EC2 currently running DataNode,
> before I start above two Daemons I make sure I delete  data inside
> /var/log/hadoop-hdfs, /var/log/hadoop-mapreduce and /tmp/hadoop-hdfs
> folders. Starting DataNode and TaskTracker Daemon will recreate new files
> afresh.
>
> Happy Hadooping.
> NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: Hadoop - cluster set-up (for DUMMIES)... or how I did it

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Andy,

        Thank you  for sharing your experience with us. I would just like
to add that it is always good to include "dfs.name.dir" and "dfs.data.dir"
properties in hdfs-site.xml file to make sure that everything runs smoothly
as /tmp gets emptied at each restart. So, there are always chances of
loosing the data and meta info. Also, t's good to add "hadoop.tmp.dir" in
core-site.xml as it also default to /tmp.

Regards,
    Mohammad Tariq



On Fri, Nov 2, 2012 at 10:05 PM, Kartashov, Andy <An...@mpac.ca>wrote:

> Hello Hadoopers,
>
> After weeks of struggle, numerous error debugging and the like I finally
> managed to set-up a fully distributed cluster. I decided to share my
> experience with the new comers.
>  In case the experts on here disagree with some of the facts mentioned
> here-in feel free to correct or add your comments.
>
> Example Cluster Topology:
> Node 1 – NameNode+JobTracker
> Node 2 – SecondaryNameNode
> Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N
>
> Configuration set-up after you installed Hadoop:
>
> Firstly, you will need to find every host address of your respective Node
> by running:
> $hostname –f
>
> Your /etc/hadoop/ folder contains subfolders of your configuration files.
>  Your installation will create a default folder conf.empty. Copy it to, say
> conf.cluster and make sure your soft link conf-> points to conf.cluster
>
> You can see what it points now to by running:
> $ alternatives --display hadoop-conf
>
> Make a new link and set it to point to conf.cluster:
> $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf
> /etc/hadoop/conf.cluster 50
> $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
> Run the display again to check proper configuration
> $ alternatives --display hadoop-conf
>
> Let’s go inside conf.cluster
> $cd conf.cluster/
>
> As a minimum, we will need to modify the following files:
> 1.      core-site.xml
> <property>
>   <name>fs.defaultFS</name>
>     <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your
> NameNode -Node1 which you found with “hostname –f” above
>   </property>
>
> 2.      mapred-site.xml
>   <property>
>     <name>mapred.job.tracker</name>
>     <!--<value><host-name>:8021</value> --> # it is host-name of your
> NameNode – Node 1  as well, since we intend to run NameNode and JobTracker
> on the same machine
>     <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
>   </property>
>
> 3.      masters # if this file doesn’t exist yet, create it and add one
> line:
> <host-name> # it is the host-name of your Node2 – running SecondaryNameNode
>
> 4.      slaves # if this file doesn’t exist yet, create it and add your
> host-names ( one per line):
> <host-name> # it is the host-name of your Node3 – running DataNode1
> <host-name> # it is the host-name of your Node4 – running DataNode2
> ….
> <host-name> # it is the host-name of your NodeN – running DataNodeN
>
>
> 5.      If you are not comfortable touching hdfs-site.xml, no problem,
> after you format your NameNode, it will create dfs/name dfs/data etc.
> folder structure in your local Linux default /tmp/hadoop-hdfs/directory.
> You could later change this to a different folder by specifying
> hdfs-site.xml  but please learn on the file structure/permissions/owners of
> those directories /dfs/data dfs/name dfs/namesecondary etc that were
> created for you by default first.
>
> Let’s format HDFS namespace: (note we format it as hdfs user)
> $ sudo –u hdfs hadoop  namenode –format
> NOTE – that you only run this command ONCE on the NameNode only!
>
> I only added the following property to my hdfs-site.xml on the NameNode-
> Node1 for the SecondaryNameNode to use:
>
> <property>
>   <name>dfs.namenode.http-address</name>
>   <value>namenode.host.address:50070</value>   # I change this to
> 0.0.0.0:50070 for EC2 environment
>   <description>
>     Needed for running SNN
>     The address and the base port on which the dfs NameNode Web UI will
> listen.
>     If the port is 0, the server will start on a free port.
>   </description>
> </property>other SNN properties for hdfs-site.xml
>
> 6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2
> (SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis
> directory (see above).
>
> 7.              Now we ready to start daemons:
>
>         Everytime you start a respective Daemon, a log report is written.
>  This is the FIRST place to look for potential problems.
> Unless you change the property in hadoop-env.sh, found in your
> /conf/conf.cluster/ directory, namely “#export
> HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each
> respective Node to:
> NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
> JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or
> “/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your
> MR. Respective Daemon will have a respective filename ending with .log
>
>                 I came across a lot of errors playing with this, as
> follows:
> a.      Error: connection refused
> This is normally caused by your firewall. Try running “sudo
> /etc/init.d/iptables status”.  I bet it is running. Solution: either add
> allowed ports or temporarily turn off iptables by running “sudo service
> iptables stop”
> Try to restart your Daemon (that is refused connection) and check your
> respective /var/log/…. Datanode or TaskTracker or else .log file again.
> This solved my problems with connections. You can test connection by
> running  “telnet <ip-address> <port>” of the Node you are trying to connect
> to.
> b.      Binding exception.
> This happens when you try to start a Daemon on the machine that is not
> supposed to run this Daemon. For example,  trying to start JobTracker on a
> slave machine.  This is a given.  JobTracker is already running on your
> MasterNode -  Node1 hence the binding Exception.
> c.      Java heap size or Java Child exception were thrown when I ran too
> small of an instance on EC2. Increasing it from tiny to small or from small
> to medium, solved the issue.
> d.      DataNode running on slave throws an Exception about DataNode id
> –mismatch. This happened when I tried to duplicate an instance on EC2, and
> as a result ended up with two different DataNodes with the same id.
> Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and
> restarting dataNode Daemon solved this issue.
> Now, that you fixed your above errors and restarted respective Daemons
> your ..log files should be clean of any errors.
>
> Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered
> with the Master Namenode - Node1.
> “$hadoop dfsadmin –printTopology”
> Should display all your host-addresses you mentioned in the
> /conf.cluster/slaves file.
>
>
> 8.      Let’s create some structure inside hdfs:
>  Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is
> up and running
> $ sudo -u hdfs hadoop fs -mkdir /tmp
> $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
>
>  Create MapReduce /var directories (YARN requires different structure)
> sudo -u hdfs hadoop fs -mkdir /var
> sudo -u hdfs hadoop fs -mkdir /var/lib
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
> sudo -u hdfs hadoop fs -mkdir
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chmod 1777
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
>
> Verify the HDFS File Structure
> $ sudo -u hdfs hadoop fs -ls -R /
>
> You should see:
> drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs/cache
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19
> /var/lib/hadoop-hdfs/cache/mapred
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29
> /var/lib/hadoop-hdfs/cache/mapred/mapred
> drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
>
> Create a Home Directory for each MapReduce User
> Create a home directory for each MapReduce user. It is best to do this on
> the NameNode; for example:
> $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
> where <user> is the Linux username of each user.
>
>
> p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
> 1. check your firewall (iptables) if running and what ports are allowed
> 2. add hostname (by running "$hostname -f") inside your
> /conf/conf.cluster/slaves on NameNode1 ONLY!
> 3. start DataNode + TaskTracker on the newly added Node
> 4. restart DataNode / JobTracker on your NameNode1
> 5. Check that your DataNode registered by running "hadoop dfsadmin
> -printTopology".
> 6. If I am duplicating an instance on EC2 currently running DataNode,
> before I start above two Daemons I make sure I delete  data inside
> /var/log/hadoop-hdfs, /var/log/hadoop-mapreduce and /tmp/hadoop-hdfs
> folders. Starting DataNode and TaskTracker Daemon will recreate new files
> afresh.
>
> Happy Hadooping.
> NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: Hadoop - cluster set-up (for DUMMIES)... or how I did it

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Andy,

        Thank you  for sharing your experience with us. I would just like
to add that it is always good to include "dfs.name.dir" and "dfs.data.dir"
properties in hdfs-site.xml file to make sure that everything runs smoothly
as /tmp gets emptied at each restart. So, there are always chances of
loosing the data and meta info. Also, t's good to add "hadoop.tmp.dir" in
core-site.xml as it also default to /tmp.

Regards,
    Mohammad Tariq



On Fri, Nov 2, 2012 at 10:05 PM, Kartashov, Andy <An...@mpac.ca>wrote:

> Hello Hadoopers,
>
> After weeks of struggle, numerous error debugging and the like I finally
> managed to set-up a fully distributed cluster. I decided to share my
> experience with the new comers.
>  In case the experts on here disagree with some of the facts mentioned
> here-in feel free to correct or add your comments.
>
> Example Cluster Topology:
> Node 1 – NameNode+JobTracker
> Node 2 – SecondaryNameNode
> Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N
>
> Configuration set-up after you installed Hadoop:
>
> Firstly, you will need to find every host address of your respective Node
> by running:
> $hostname –f
>
> Your /etc/hadoop/ folder contains subfolders of your configuration files.
>  Your installation will create a default folder conf.empty. Copy it to, say
> conf.cluster and make sure your soft link conf-> points to conf.cluster
>
> You can see what it points now to by running:
> $ alternatives --display hadoop-conf
>
> Make a new link and set it to point to conf.cluster:
> $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf
> /etc/hadoop/conf.cluster 50
> $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
> Run the display again to check proper configuration
> $ alternatives --display hadoop-conf
>
> Let’s go inside conf.cluster
> $cd conf.cluster/
>
> As a minimum, we will need to modify the following files:
> 1.      core-site.xml
> <property>
>   <name>fs.defaultFS</name>
>     <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your
> NameNode -Node1 which you found with “hostname –f” above
>   </property>
>
> 2.      mapred-site.xml
>   <property>
>     <name>mapred.job.tracker</name>
>     <!--<value><host-name>:8021</value> --> # it is host-name of your
> NameNode – Node 1  as well, since we intend to run NameNode and JobTracker
> on the same machine
>     <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
>   </property>
>
> 3.      masters # if this file doesn’t exist yet, create it and add one
> line:
> <host-name> # it is the host-name of your Node2 – running SecondaryNameNode
>
> 4.      slaves # if this file doesn’t exist yet, create it and add your
> host-names ( one per line):
> <host-name> # it is the host-name of your Node3 – running DataNode1
> <host-name> # it is the host-name of your Node4 – running DataNode2
> ….
> <host-name> # it is the host-name of your NodeN – running DataNodeN
>
>
> 5.      If you are not comfortable touching hdfs-site.xml, no problem,
> after you format your NameNode, it will create dfs/name dfs/data etc.
> folder structure in your local Linux default /tmp/hadoop-hdfs/directory.
> You could later change this to a different folder by specifying
> hdfs-site.xml  but please learn on the file structure/permissions/owners of
> those directories /dfs/data dfs/name dfs/namesecondary etc that were
> created for you by default first.
>
> Let’s format HDFS namespace: (note we format it as hdfs user)
> $ sudo –u hdfs hadoop  namenode –format
> NOTE – that you only run this command ONCE on the NameNode only!
>
> I only added the following property to my hdfs-site.xml on the NameNode-
> Node1 for the SecondaryNameNode to use:
>
> <property>
>   <name>dfs.namenode.http-address</name>
>   <value>namenode.host.address:50070</value>   # I change this to
> 0.0.0.0:50070 for EC2 environment
>   <description>
>     Needed for running SNN
>     The address and the base port on which the dfs NameNode Web UI will
> listen.
>     If the port is 0, the server will start on a free port.
>   </description>
> </property>other SNN properties for hdfs-site.xml
>
> 6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2
> (SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis
> directory (see above).
>
> 7.              Now we ready to start daemons:
>
>         Everytime you start a respective Daemon, a log report is written.
>  This is the FIRST place to look for potential problems.
> Unless you change the property in hadoop-env.sh, found in your
> /conf/conf.cluster/ directory, namely “#export
> HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each
> respective Node to:
> NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
> JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or
> “/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your
> MR. Respective Daemon will have a respective filename ending with .log
>
>                 I came across a lot of errors playing with this, as
> follows:
> a.      Error: connection refused
> This is normally caused by your firewall. Try running “sudo
> /etc/init.d/iptables status”.  I bet it is running. Solution: either add
> allowed ports or temporarily turn off iptables by running “sudo service
> iptables stop”
> Try to restart your Daemon (that is refused connection) and check your
> respective /var/log/…. Datanode or TaskTracker or else .log file again.
> This solved my problems with connections. You can test connection by
> running  “telnet <ip-address> <port>” of the Node you are trying to connect
> to.
> b.      Binding exception.
> This happens when you try to start a Daemon on the machine that is not
> supposed to run this Daemon. For example,  trying to start JobTracker on a
> slave machine.  This is a given.  JobTracker is already running on your
> MasterNode -  Node1 hence the binding Exception.
> c.      Java heap size or Java Child exception were thrown when I ran too
> small of an instance on EC2. Increasing it from tiny to small or from small
> to medium, solved the issue.
> d.      DataNode running on slave throws an Exception about DataNode id
> –mismatch. This happened when I tried to duplicate an instance on EC2, and
> as a result ended up with two different DataNodes with the same id.
> Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and
> restarting dataNode Daemon solved this issue.
> Now, that you fixed your above errors and restarted respective Daemons
> your ..log files should be clean of any errors.
>
> Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered
> with the Master Namenode - Node1.
> “$hadoop dfsadmin –printTopology”
> Should display all your host-addresses you mentioned in the
> /conf.cluster/slaves file.
>
>
> 8.      Let’s create some structure inside hdfs:
>  Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is
> up and running
> $ sudo -u hdfs hadoop fs -mkdir /tmp
> $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
>
>  Create MapReduce /var directories (YARN requires different structure)
> sudo -u hdfs hadoop fs -mkdir /var
> sudo -u hdfs hadoop fs -mkdir /var/lib
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
> sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
> sudo -u hdfs hadoop fs -mkdir
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chmod 1777
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
> sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
>
> Verify the HDFS File Structure
> $ sudo -u hdfs hadoop fs -ls -R /
>
> You should see:
> drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs
> drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16
> /var/lib/hadoop-hdfs/cache
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19
> /var/lib/hadoop-hdfs/cache/mapred
> drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29
> /var/lib/hadoop-hdfs/cache/mapred/mapred
> drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33
> /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
>
> Create a Home Directory for each MapReduce User
> Create a home directory for each MapReduce user. It is best to do this on
> the NameNode; for example:
> $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
> where <user> is the Linux username of each user.
>
>
> p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
> 1. check your firewall (iptables) if running and what ports are allowed
> 2. add hostname (by running "$hostname -f") inside your
> /conf/conf.cluster/slaves on NameNode1 ONLY!
> 3. start DataNode + TaskTracker on the newly added Node
> 4. restart DataNode / JobTracker on your NameNode1
> 5. Check that your DataNode registered by running "hadoop dfsadmin
> -printTopology".
> 6. If I am duplicating an instance on EC2 currently running DataNode,
> before I start above two Daemons I make sure I delete  data inside
> /var/log/hadoop-hdfs, /var/log/hadoop-mapreduce and /tmp/hadoop-hdfs
> folders. Starting DataNode and TaskTracker Daemon will recreate new files
> afresh.
>
> Happy Hadooping.
> NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>