You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by rajeev gupta <gr...@yahoo.com> on 2009/06/18 12:13:17 UTC

Data replication and moving computation

I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS cluster and replication factor is 1. A large file is there on one of those three cluster machines in its local file system. If I put that file in HDFS will it be divided and distributed across all three machines? I had this doubt as HDFS "moving computation is cheaper than moving data". 

If file is distributed across all three machines, lots of data transfer will be there, whereas, if file is NOT distributed then compute power of other machine will be unused. Am I missing something here?

-Raj

Re: Data replication and moving computation

Posted by Roshan James <ro...@gmail.com>.

Further, look at the namenode file system browser for your cluster to see
the chunking in action.

http://wiki.apache.org/hadoop/WebApp%20URLs

Roshan

On Thu, Jun 18, 2009 at 6:28 AM, Harish Mallipeddi <
harish.mallipeddi@gmail.com> wrote:

> On Thu, Jun 18, 2009 at 3:43 PM, rajeev gupta <gr...@yahoo.com> wrote:
>
> >
> > I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS
> > cluster and replication factor is 1. A large file is there on one of
> those
> > three cluster machines in its local file system. If I put that file in
> HDFS
> > will it be divided and distributed across all three machines? I had this
> > doubt as HDFS "moving computation is cheaper than moving data".
> >
> > If file is distributed across all three machines, lots of data transfer
> > will be there, whereas, if file is NOT distributed then compute power of
> > other machine will be unused. Am I missing something here?
> >
> > -Raj
> >
> >
> >
> Irrespective of what you set as the replication factor, large files will
> always be split into chunks (chunk size is what you set as your HDFS
> block-size) and they'll be distributed across your entire cluster.
>
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>

Re: Data replication and moving computation

Posted by Harish Mallipeddi <ha...@gmail.com>.

On Thu, Jun 18, 2009 at 3:43 PM, rajeev gupta <gr...@yahoo.com> wrote:

>
> I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS
> cluster and replication factor is 1. A large file is there on one of those
> three cluster machines in its local file system. If I put that file in HDFS
> will it be divided and distributed across all three machines? I had this
> doubt as HDFS "moving computation is cheaper than moving data".
>
> If file is distributed across all three machines, lots of data transfer
> will be there, whereas, if file is NOT distributed then compute power of
> other machine will be unused. Am I missing something here?
>
> -Raj
>
>
>
Irrespective of what you set as the replication factor, large files will
always be split into chunks (chunk size is what you set as your HDFS
block-size) and they'll be distributed across your entire cluster.


-- 
Harish Mallipeddi
http://blog.poundbang.in