You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Foss User <fo...@gmail.com> on 2009/04/04 12:47:48 UTC

Newbie questions on Hadoop topology

I was going through the tutorial here.

http://hadoop.apache.org/core/docs/current/cluster_setup.html

Certain things are not clear. I am asking them point-wise. I have a
setup of 4 linux machines. 1 name node, 1 job tracker and 2 slaves
(each is data node as well as task tracker).

1. Should I edit conf/slaves on all nodes or only on name node? Do I
have to edit this in job tracker too?

2. What does the 'bin/hadoop namenode -format' actually do? I want to
know in the OS level. Does it create some temporary folders in all the
slave-data-nodes which will be collectively interpreted as HDFS by the
Hadoop framework?

3. Does the 'bin/hadoop namenode -format' command affect name node,
job tracker and task tracker nodes (assuming there is a slave which is
only a task tracker and not a data node)?

4. If I add one more slave (datanode + task tracker) later to the
cluster, what are the changes I need to do apart from adding the IP
address of the slave node to conf/slaves? Do I need to restart any
service?

5. When I add a new slave to the cluster later, do I need to run the
namenode -format command again? If I have to, how do I ensure that
existing data is not lost. If I don't have to, how will the folders
necessary for HDFS be created in the new slave machine?

Re: Newbie questions on Hadoop topology

Posted by Todd Lipcon <to...@cloudera.com>.
On Sat, Apr 4, 2009 at 10:25 PM, Foss User <fo...@gmail.com> wrote:

>
> On Sun, Apr 5, 2009 at 10:27 AM, Todd Lipcon <to...@cloudera.com> wrote:
> > On Sat, Apr 4, 2009 at 3:47 AM, Foss User <fo...@gmail.com> wrote:
> >>
> >> 1. Should I edit conf/slaves on all nodes or only on name node? Do I
> >> have to edit this in job tracker too?
> >>
> >
> > The conf/slaves file is only used by the start/stop scripts (e.g.
> > start-all.sh). This script is just a handy wrapper that sshs to all of
> the
> > slaves to start the datanode/tasktrackers on those machines. So, you
> should
> > edit conf/slaves on whatever machine you tend to run those administrative
> > scripts from, but those are for convenience only and not necessary. You
> can
> > start the datanode/tasktracker services on the slave nodes manually and
> it
> > will work just the same.
>
> What are the commands to start data node and task tracker on a slave
> machine?
>

With the vanilla hadoop distribution, $HADOOP_HOME/bin/hadoop-daemon.sh
start datanode  (or tasktracker)

Or, if you're using the Cloudera Distribution for Hadoop, you should start
it using standard linux services (/etc/init.d/hadoop-datanode start).


> >> 5. When I add a new slave to the cluster later, do I need to run the
> >> namenode -format command again? If I have to, how do I ensure that
> >> existing data is not lost. If I don't have to, how will the folders
> >> necessary for HDFS be created in the new slave machine?
> >>
> >
> >
> > No - after starting the slave, the NN and JT will start assigning
> > blocks/jobs to the new slave immediately. The HDFS directories will be
> > created when you start up the datanode - you just need to ensure that the
> > directory configured in dfs.data.dir exists and is writable by the hadoop
> > user.
>
> All these days when I was working, dfs.data.dir was something like
> /tmp/hadoop-hadoop/dfs/data. But this directory never existed. Only
> /tmp existed and it was writable by Hadoop. On starting the namenode,
> on the master, this directory was created automatically on the masters
> as well as all slaves.
>

Starting just the namenode won't create the datadirs on the slaves. If you
used the start-dfs.sh script, that sshed into the slaves and started the
datanode on each of them, which did create the data dirs.


>
> So, are you correct in saying that directory configured in
> dfs.data.dir should already exist. Isn't it more like directory
> configured in dfs.data.dir would be automatically created if it
> doesn't exist? Only thing is that the hadoop user should have the
> permission to create it. Am I right?
>


Correct - sorry if I wasn't clear on that. The hadoop user needs to be able
to perform the equivalent of "mkdir -p" on the dfs.data.dir path.

Having the dfs.data.dir in /tmp is a default setting that you should
definitely change, though. /tmp is cleared by a cron job on most systems as
well as at boot.

-Todd

Re: Newbie questions on Hadoop topology

Posted by Foss User <fo...@gmail.com>.
I have a few more questions on your answers. Please see them inline.

On Sun, Apr 5, 2009 at 10:27 AM, Todd Lipcon <to...@cloudera.com> wrote:
> On Sat, Apr 4, 2009 at 3:47 AM, Foss User <fo...@gmail.com> wrote:
>>
>> 1. Should I edit conf/slaves on all nodes or only on name node? Do I
>> have to edit this in job tracker too?
>>
>
> The conf/slaves file is only used by the start/stop scripts (e.g.
> start-all.sh). This script is just a handy wrapper that sshs to all of the
> slaves to start the datanode/tasktrackers on those machines. So, you should
> edit conf/slaves on whatever machine you tend to run those administrative
> scripts from, but those are for convenience only and not necessary. You can
> start the datanode/tasktracker services on the slave nodes manually and it
> will work just the same.

What are the commands to start data node and task tracker on a slave machine?

>> 5. When I add a new slave to the cluster later, do I need to run the
>> namenode -format command again? If I have to, how do I ensure that
>> existing data is not lost. If I don't have to, how will the folders
>> necessary for HDFS be created in the new slave machine?
>>
>
>
> No - after starting the slave, the NN and JT will start assigning
> blocks/jobs to the new slave immediately. The HDFS directories will be
> created when you start up the datanode - you just need to ensure that the
> directory configured in dfs.data.dir exists and is writable by the hadoop
> user.

All these days when I was working, dfs.data.dir was something like
/tmp/hadoop-hadoop/dfs/data. But this directory never existed. Only
/tmp existed and it was writable by Hadoop. On starting the namenode,
on the master, this directory was created automatically on the masters
as well as all slaves.

So, are you correct in saying that directory configured in
dfs.data.dir should already exist. Isn't it more like directory
configured in dfs.data.dir would be automatically created if it
doesn't exist? Only thing is that the hadoop user should have the
permission to create it. Am I right?

Re: Newbie questions on Hadoop topology

Posted by Todd Lipcon <to...@cloudera.com>.
On Sat, Apr 4, 2009 at 3:47 AM, Foss User <fo...@gmail.com> wrote:

> Certain things are not clear. I am asking them point-wise. I have a
> setup of 4 linux machines. 1 name node, 1 job tracker and 2 slaves
> (each is data node as well as task tracker).
>

For a cluster of this size, you probably want to run one machine that is
both the NN and JT, and the other 3 as slaves. There's no problem colocating
multiple daemons on the same box as long as it's not overloaded. Given it's
a small cluster, it should be fine.


>
> 1. Should I edit conf/slaves on all nodes or only on name node? Do I
> have to edit this in job tracker too?
>

The conf/slaves file is only used by the start/stop scripts (e.g.
start-all.sh). This script is just a handy wrapper that sshs to all of the
slaves to start the datanode/tasktrackers on those machines. So, you should
edit conf/slaves on whatever machine you tend to run those administrative
scripts from, but those are for convenience only and not necessary. You can
start the datanode/tasktracker services on the slave nodes manually and it
will work just the same.


>
> 2. What does the 'bin/hadoop namenode -format' actually do? I want to
> know in the OS level. Does it create some temporary folders in all the
> slave-data-nodes which will be collectively interpreted as HDFS by the
> Hadoop framework?
>

namenode -format is run on the namenode machine and sets up the on-disk
database/storage for the filesystem metadata in dfs.name.dir. The datanodes
maintain their storage automatically and don't need any particular "format"
command to be run - simply list a directory in dfs.data.dir in
hadoop-site.xml, and the datanode will start using it for block storage.


>
> 3. Does the 'bin/hadoop namenode -format' command affect name node,
> job tracker and task tracker nodes (assuming there is a slave which is
> only a task tracker and not a data node)?
>

See above -- it simply affects the metadata store on the namenode. The
jobtracker and task trackers are unaffected, and technically the datanodes
are unaffected as well. Datanodes will "find out" about the formatting when
they report block locations for files that the namenode no longer knows
about.


>
> 4. If I add one more slave (datanode + task tracker) later to the
> cluster, what are the changes I need to do apart from adding the IP
> address of the slave node to conf/slaves? Do I need to restart any
> service?
>

You simply need to start the DN/TT on the new node. Adding it to conf/slaves
only affects the start/stop scripts. The DN and TT will contact the NN/JT
respectively and register themselves in the system.


>
> 5. When I add a new slave to the cluster later, do I need to run the
> namenode -format command again? If I have to, how do I ensure that
> existing data is not lost. If I don't have to, how will the folders
> necessary for HDFS be created in the new slave machine?
>


No - after starting the slave, the NN and JT will start assigning
blocks/jobs to the new slave immediately. The HDFS directories will be
created when you start up the datanode - you just need to ensure that the
directory configured in dfs.data.dir exists and is writable by the hadoop
user.

Hope that helps

-Todd