You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Vamc <kr...@gmail.com> on 2010/05/17 13:17:52 UTC

Basic Hadoop Doubt

Hi All,

Vamc here, Buddy in Hadoop 

I have some basic doubt on hadoop Input Data placement...

Like, If i input some 30GB of data to hadoop program , it will place the
30gb into HDFS  into some set of files based on some input formats..

I have 2 doubts here ..

1. Each time i run a program 30GB is placed into HDFS or how its going to
Work 
2. Again if i want to run some other program on another 100Gb of data, where
the above stated data and program is different. Then the previous 30GB is
erased in HDFS or how its going to run..


please respond .

 Thanks  In Advance 

   vamsi 


-- 
View this message in context: http://old.nabble.com/Basic-Hadoop-Doubt-tp28582088p28582088.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.


Re: Basic Hadoop Doubt

Posted by Hemanth Yamijala <yh...@gmail.com>.
Vamsi,

>
> I have some basic doubt on hadoop Input Data placement...
>
> Like, If i input some 30GB of data to hadoop program , it will place the
> 30gb into HDFS  into some set of files based on some input formats..

Conceptually, it would be more accurate to say that it splits the data
into 'blocks' that are managed in HDFS. Of course,
implementation-wise, these blocks do get stored on physical files on
the datanodes.

>
> I have 2 doubts here ..
>
> 1. Each time i run a program 30GB is placed into HDFS or how its going to
> Work

What program ? Are you talking about this 30GB as input to the program
or output from it ? Assuming Map/Reduce input, the answer is in
general, no. A typical M/R program takes an input path on DFS and this
can point to data that's already copied to DFS, independent of the
program itself.

> 2. Again if i want to run some other program on another 100Gb of data, where
> the above stated data and program is different. Then the previous 30GB is
> erased in HDFS or how its going to run..
>

Given that the program and its input are independent, the program will
not modify any existing data. In fact most Map/Reduce applications do
not overwrite output data as well. Rather, they will refuse to start
if the output directory already exists.

Thanks
Hemanth