You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vijay Rao <ra...@gmail.com> on 2010/05/09 08:49:19 UTC

Fundamental question

Hello,

I am just reading and understanding Hadoop and all the other components.
However I have a fundamental question for which I am not getting answers in
any of the online material that is out there.

1) If hadoop is used then all the slaves and other machines in the cluster
need to be formatted to have HDFS file system. If so what happens to the
tera bytes of data that need to be crunched? Or is the data on a different
machine?

2) Everywhere it is mentioned that the main advantage of map/reduce and
hadoop is that it runs on data that is available locally. So does this mean
that once the file system is formatted then I have to move my terabytes of
data and split them across the cluster?

Thanks
VJ

Re: Fundamental question

Posted by Joseph Stein <cr...@gmail.com>.

1) only the namenode is "formatted" and what happens is basically the
image file is created and prepped.  The image file holds the meta data
about how your files are stored on the cluster.

2)  The datanodes are not formatted in the conventional sense.  Their
(datanode) disk usage will grow only when data in the cluster grows
when they store data.  The moving of file is all automatic and depends
on your replication settings (defaulted to 3 nodes).  HDFS takes your
data, breaks it up into blocks and replicates them across the nodes
http://hadoop.apache.org/common/docs/current/hdfs_design.html#Data+Replication
.  When the Map/Reduce jobs run Hadoop will try to start those jobs on
the datanodes where the data already is (so as to not waste time
moving the data) otherwise it will move the data to nodes that can do
crunching (like if they are idle) to maximize the cluster use.

/*
Joe Stein
http://allthingshadoop.com
*/

On Sun, May 9, 2010 at 2:49 AM, Vijay Rao <ra...@gmail.com> wrote:
> Hello,
>
> I am just reading and understanding Hadoop and all the other components.
> However I have a fundamental question for which I am not getting answers in
> any of the online material that is out there.
>
> 1) If hadoop is used then all the slaves and other machines in the cluster
> need to be formatted to have HDFS file system. If so what happens to the
> tera bytes of data that need to be crunched? Or is the data on a different
> machine?
>
> 2) Everywhere it is mentioned that the main advantage of map/reduce and
> hadoop is that it runs on data that is available locally. So does this mean
> that once the file system is formatted then I have to move my terabytes of
> data and split them across the cluster?
>
> Thanks
> VJ
>

Re: Fundamental question

Posted by Bill Habermaas <bi...@habermaas.us>.

These questions are usually answered once you start using the system but 
I'll provide some quick answers.

1. Hadoop uses the local file system at each node to store blocks. The only 
part of the system that needs to be formatted is the namenode which is where 
Hadoop keeps track of the logical HDFS filesystem image that contains the 
directory structure, files and the datanodes where they reside. A file in 
HDFS is a sequence of blocks. When the file has a replication factor 
(usually 3) then each block has 3 exact copies that reside at different 
datanodes. This is important to remember for your second question.

2. The notion of processing locally is simply that map/reduce will process a 
file at different nodes by reading the blocks that are located at that 
location.  So if you have 3 copies of the same block at different nodes, 
then the system can pick nodes where it can process those blocks locally. In 
order to process the entire file, map/reduce runs parallel tasks that 
process the blocks locally at each node.  Once you have data in the HDFS 
cluster it is not necessary to move things around.  The framework does that 
transparently. An example might help: say file has blocks 1,2,3,4 which are 
replicated across 3 datanodes (A,B,C).  Due to replication there is a copy 
of each block residing at each node. When the map/reduce job is started by 
the jobtracker, it begins a task at each node: (A will process blocks 1 & 
2),   B will process block 3, and C will process block 4).  All these tasks 
run in parallel so if you are handling a terrabyte+ file there is a big 
reduction in processing time.  Each task writes it's map/reduce output to a 
specific output directory (in this case 3 files) which can be inputted to 
the next map/reduce job.

I hope this brief answer is helpful and provides some insight.

Bill



----- Original Message ----- 
From: "Vijay Rao" <ra...@gmail.com>
To: <co...@hadoop.apache.org>
Sent: Sunday, May 09, 2010 2:49 AM
Subject: Fundamental question


> Hello,
>
> I am just reading and understanding Hadoop and all the other components.
> However I have a fundamental question for which I am not getting answers 
> in
> any of the online material that is out there.
>
> 1) If hadoop is used then all the slaves and other machines in the cluster
> need to be formatted to have HDFS file system. If so what happens to the
> tera bytes of data that need to be crunched? Or is the data on a different
> machine?
>
> 2) Everywhere it is mentioned that the main advantage of map/reduce and
> hadoop is that it runs on data that is available locally. So does this 
> mean
> that once the file system is formatted then I have to move my terabytes of
> data and split them across the cluster?
>
> Thanks
> VJ
>