You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Bill Q <bi...@gmail.com> on 2013/03/07 05:21:28 UTC

HDFS network traffic

Hi All,
I am working on converting a sequence file to mapfile and just discovered
something I wasn't aware of.

For example, suppose I am working on a 2-node cluster, one
master/namenode/datanode, one slave/datanode. If I do hadoop dfs -cp
/data/file1 /data/file2 (a 1G file) from the master, and monitor the NIC of
both nodes, I saw that the master node send the entire file of 1G traffic
to the slave. This surprised me. Does this mean all the traffic has to go
through the client node that runs the command (in this case, the master)
when I do hadoop dfs -cp?

Many thanks.


Bill

Re: HDFS network traffic

Posted by Harsh J <ha...@cloudera.com>.
Yes, the simple copy is a client operation. Client reads bytes from
source and writes to the destination, thereby being in control of
failures, etc.. However, if you want your cluster to do the copy (and
if the copy is a big set), consider using the DistCp
(distributed-copy) MR job to do it.

On Thu, Mar 7, 2013 at 9:51 AM, Bill Q <bi...@gmail.com> wrote:
> Hi All,
> I am working on converting a sequence file to mapfile and just discovered
> something I wasn't aware of.
>
> For example, suppose I am working on a 2-node cluster, one
> master/namenode/datanode, one slave/datanode. If I do hadoop dfs -cp
> /data/file1 /data/file2 (a 1G file) from the master, and monitor the NIC of
> both nodes, I saw that the master node send the entire file of 1G traffic to
> the slave. This surprised me. Does this mean all the traffic has to go
> through the client node that runs the command (in this case, the master)
> when I do hadoop dfs -cp?
>
> Many thanks.
>
>
> Bill



--
Harsh J

Re: HDFS network traffic

Posted by Harsh J <ha...@cloudera.com>.
Yes, the simple copy is a client operation. Client reads bytes from
source and writes to the destination, thereby being in control of
failures, etc.. However, if you want your cluster to do the copy (and
if the copy is a big set), consider using the DistCp
(distributed-copy) MR job to do it.

On Thu, Mar 7, 2013 at 9:51 AM, Bill Q <bi...@gmail.com> wrote:
> Hi All,
> I am working on converting a sequence file to mapfile and just discovered
> something I wasn't aware of.
>
> For example, suppose I am working on a 2-node cluster, one
> master/namenode/datanode, one slave/datanode. If I do hadoop dfs -cp
> /data/file1 /data/file2 (a 1G file) from the master, and monitor the NIC of
> both nodes, I saw that the master node send the entire file of 1G traffic to
> the slave. This surprised me. Does this mean all the traffic has to go
> through the client node that runs the command (in this case, the master)
> when I do hadoop dfs -cp?
>
> Many thanks.
>
>
> Bill



--
Harsh J

Re: HDFS network traffic

Posted by Harsh J <ha...@cloudera.com>.
Yes, the simple copy is a client operation. Client reads bytes from
source and writes to the destination, thereby being in control of
failures, etc.. However, if you want your cluster to do the copy (and
if the copy is a big set), consider using the DistCp
(distributed-copy) MR job to do it.

On Thu, Mar 7, 2013 at 9:51 AM, Bill Q <bi...@gmail.com> wrote:
> Hi All,
> I am working on converting a sequence file to mapfile and just discovered
> something I wasn't aware of.
>
> For example, suppose I am working on a 2-node cluster, one
> master/namenode/datanode, one slave/datanode. If I do hadoop dfs -cp
> /data/file1 /data/file2 (a 1G file) from the master, and monitor the NIC of
> both nodes, I saw that the master node send the entire file of 1G traffic to
> the slave. This surprised me. Does this mean all the traffic has to go
> through the client node that runs the command (in this case, the master)
> when I do hadoop dfs -cp?
>
> Many thanks.
>
>
> Bill



--
Harsh J

Re: HDFS network traffic

Posted by Harsh J <ha...@cloudera.com>.
Yes, the simple copy is a client operation. Client reads bytes from
source and writes to the destination, thereby being in control of
failures, etc.. However, if you want your cluster to do the copy (and
if the copy is a big set), consider using the DistCp
(distributed-copy) MR job to do it.

On Thu, Mar 7, 2013 at 9:51 AM, Bill Q <bi...@gmail.com> wrote:
> Hi All,
> I am working on converting a sequence file to mapfile and just discovered
> something I wasn't aware of.
>
> For example, suppose I am working on a 2-node cluster, one
> master/namenode/datanode, one slave/datanode. If I do hadoop dfs -cp
> /data/file1 /data/file2 (a 1G file) from the master, and monitor the NIC of
> both nodes, I saw that the master node send the entire file of 1G traffic to
> the slave. This surprised me. Does this mean all the traffic has to go
> through the client node that runs the command (in this case, the master)
> when I do hadoop dfs -cp?
>
> Many thanks.
>
>
> Bill



--
Harsh J