You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Vijay Rao <ra...@gmail.com> on 2010/05/09 07:42:24 UTC

Fundamental question

Hello,

I am just reading and understanding Hadoop and all the other components.
However I have a fundamental question for which I am not getting answers in
any of the online material that is out there.

1) If hadoop is used then all the slaves and other machines in the cluster
need to be formatted to have HDFS file system. If so what happens to the
tera bytes of data that need to be crunched? Or is the data on a different
machine?

2) Everywhere it is mentioned that the main advantage of map/reduce and
hadoop is that it runs on data that is available locally. So does this mean
that once the file system is formatted then I have to move my terabytes of
data and split them across the cluster?

Thanks
VJ

Re: Fundamental question

Posted by Tim Robertson <ti...@gmail.com>.

Hi VJ

1) If hadoop is used then all the slaves and other machines in the cluster
> need to be formatted to have HDFS file system. If so what happens to the
> tera bytes of data that need to be crunched? Or is the data on a different
> machine?
>

You actually assign directories on the machines to be the directories Hadoop
uses in the DFS.  Therefore the machines can also have other data as you
don't format the whole drives.  So for example I have /mnt/disk1/hadoop and
/mnt/disk2/hadoop on each of my Data Nodes for HDFS to use.  My machines are
dedicated to Hadoop so don't store other data in addition.

2) Everywhere it is mentioned that the main advantage of map/reduce and
> hadoop is that it runs on data that is available locally. So does this mean
> that once the file system is formatted then I have to move my terabytes of
> data and split them across the cluster?
>

Once you copy data into HDFS which you then *might* consider removing from
the local drives.  I think it is more common to dedicate a cluster to Hadoop
and then copy data into the DFS from external locations (e.g. it doesn't
also sit on local drives in the Hadoop cluster).  This is how we use it
anyway.  When you launch a MR job, it knows where the data chunks are
located and then runs the processing on the machine in the cluster with
those chunks.  Remember HDFS will store redundant copies, so you might copy
in a 200GB file, it gets split up and copied around the cluster and perhaps
each chunk saved 3 times.  Then when it needs to process, there are 3
machines with any given chunk locally stored - Hadoop will try and schedule
the tasks needed to complete the job to minimise copying around and run it
on a machine with the data already.

Since you seem interested in the best set up for MapReduce you might get
better responses on the mapreduce-user mailing list.

Hope this helps,
Tim


> Thanks
> VJ
>