You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by KayVajj <va...@gmail.com> on 2013/04/11 00:20:50 UTC

Copy Vs DistCP

I have few questions regarding the usage of DistCP for copying files in the
same cluster.


1) Which one is better within a  same cluster and what factors (like file
size etc) wouldinfluence the usage of one over te other?

2) when we run a cp command like below from a  client node of the cluster
(not a data node), How does the cp command work
     i) like an MR job
    ii) copy files locally and then it copy it back at the new location.

Example of the copy command

hdfs dfs -cp /<some_location>/file /<new_location>/

Thanks, your responses are appreciated.

-- Kay

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?

On Wed, Apr 10, 2013 at 6:20 PM, KayVajj <va...@gmail.com> wrote:

> I have few questions regarding the usage of DistCP for copying files in
> the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
> 2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>      i) like an MR job
>     ii) copy files locally and then it copy it back at the new location.
>
> Example of the copy command
>
> hdfs dfs -cp /<some_location>/file /<new_location>/
>
> Thanks, your responses are appreciated.
>
> -- Kay
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?

On Wed, Apr 10, 2013 at 6:20 PM, KayVajj <va...@gmail.com> wrote:

> I have few questions regarding the usage of DistCP for copying files in
> the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
> 2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>      i) like an MR job
>     ii) copy files locally and then it copy it back at the new location.
>
> Example of the copy command
>
> hdfs dfs -cp /<some_location>/file /<new_location>/
>
> Thanks, your responses are appreciated.
>
> -- Kay
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

yes, you are right.


On Thu, Apr 11, 2013 at 3:40 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> AFAIK, the cp command works fully from the DFS client. It reads bytes from
> the InputStream created when the file is opened and writes the same to the
> OutputStream of the file. It does not work at the level of data blocks. A
> configuration io.file.buffer.size is used as the size of the buffer used in
> copy - set to 4096 by default.
>
> Thanks
> Hemanth
>
>
> On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:
>
>> If CP command is not parallel how does it work for a file partitioned on
>> various data nodes?
>>
>>
>> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

yes, you are right.


On Thu, Apr 11, 2013 at 3:40 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> AFAIK, the cp command works fully from the DFS client. It reads bytes from
> the InputStream created when the file is opened and writes the same to the
> OutputStream of the file. It does not work at the level of data blocks. A
> configuration io.file.buffer.size is used as the size of the buffer used in
> copy - set to 4096 by default.
>
> Thanks
> Hemanth
>
>
> On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:
>
>> If CP command is not parallel how does it work for a file partitioned on
>> various data nodes?
>>
>>
>> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

yes, you are right.


On Thu, Apr 11, 2013 at 3:40 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> AFAIK, the cp command works fully from the DFS client. It reads bytes from
> the InputStream created when the file is opened and writes the same to the
> OutputStream of the file. It does not work at the level of data blocks. A
> configuration io.file.buffer.size is used as the size of the buffer used in
> copy - set to 4096 by default.
>
> Thanks
> Hemanth
>
>
> On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:
>
>> If CP command is not parallel how does it work for a file partitioned on
>> various data nodes?
>>
>>
>> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

yes, you are right.


On Thu, Apr 11, 2013 at 3:40 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> AFAIK, the cp command works fully from the DFS client. It reads bytes from
> the InputStream created when the file is opened and writes the same to the
> OutputStream of the file. It does not work at the level of data blocks. A
> configuration io.file.buffer.size is used as the size of the buffer used in
> copy - set to 4096 by default.
>
> Thanks
> Hemanth
>
>
> On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:
>
>> If CP command is not parallel how does it work for a file partitioned on
>> various data nodes?
>>
>>
>> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

AFAIK, the cp command works fully from the DFS client. It reads bytes from
the InputStream created when the file is opened and writes the same to the
OutputStream of the file. It does not work at the level of data blocks. A
configuration io.file.buffer.size is used as the size of the buffer used in
copy - set to 4096 by default.

Thanks
Hemanth


On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:

> If CP command is not parallel how does it work for a file partitioned on
> various data nodes?
>
>
> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi，
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

AFAIK, the cp command works fully from the DFS client. It reads bytes from
the InputStream created when the file is opened and writes the same to the
OutputStream of the file. It does not work at the level of data blocks. A
configuration io.file.buffer.size is used as the size of the buffer used in
copy - set to 4096 by default.

Thanks
Hemanth


On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:

> If CP command is not parallel how does it work for a file partitioned on
> various data nodes?
>
>
> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi，
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

AFAIK, the cp command works fully from the DFS client. It reads bytes from
the InputStream created when the file is opened and writes the same to the
OutputStream of the file. It does not work at the level of data blocks. A
configuration io.file.buffer.size is used as the size of the buffer used in
copy - set to 4096 by default.

Thanks
Hemanth


On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:

> If CP command is not parallel how does it work for a file partitioned on
> various data nodes?
>
>
> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi，
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

AFAIK, the cp command works fully from the DFS client. It reads bytes from
the InputStream created when the file is opened and writes the same to the
OutputStream of the file. It does not work at the level of data blocks. A
configuration io.file.buffer.size is used as the size of the buffer used in
copy - set to 4096 by default.

Thanks
Hemanth


On Thu, Apr 11, 2013 at 9:42 AM, KayVajj <va...@gmail.com> wrote:

> If CP command is not parallel how does it work for a file partitioned on
> various data nodes?
>
>
> On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi，
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

If CP command is not parallel how does it work for a file partitioned on
various data nodes?


On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

DistCP is prefer for your requirements.


On Fri, Apr 12, 2013 at 12:52 AM, KayVajj <va...@gmail.com> wrote:

> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
> does)
>
>
> I did not run any comparisons as my dev cluster is just a two node cluster
> and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>> if cluster is busy with other jobs distcp will wait for free map slots.
>> Regular cp is more reliable and predictable. Especialy if you need to copy
>> just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, ������ <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi��
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  ������
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>

Re: Copy Vs DistCP

Posted by Amal G Jose <am...@gmail.com>.

For copying large files, I prefer distcp.


On Sun, Apr 14, 2013 at 11:31 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
>
> On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
> mathias.herberts@gmail.com> wrote:
>
>>
>> >
>> > This is absolutely true.  Distcp dominates cp for large copies.  On the
>> other hand cp dominates distcp for convenience.
>> >
>> > In my own experience, I love cp when copying relatively small amounts
>> of data (10's of GB) where the available bandwidth of about a GB/s allows
>> the copy to complete in less time that it takes distcp to get started.
>> >
>> > At larger sizes (100's of GB and up), the startup time of distcp
>> doesn't matter because once it gets going, it moves data much faster.
>>
>> Maybe we could put together a 'fs -smartcp' which choses wisely between
>> copy and distcp depending on file size
>>
>
> Uh... hmm...
>
> This is a good suggestion.  Obvious in fact.  In retrospect.
>
> I would also suggest that the new command be called "distcp".
>
>

Re: Copy Vs DistCP

Posted by Amal G Jose <am...@gmail.com>.

For copying large files, I prefer distcp.


On Sun, Apr 14, 2013 at 11:31 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
>
> On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
> mathias.herberts@gmail.com> wrote:
>
>>
>> >
>> > This is absolutely true.  Distcp dominates cp for large copies.  On the
>> other hand cp dominates distcp for convenience.
>> >
>> > In my own experience, I love cp when copying relatively small amounts
>> of data (10's of GB) where the available bandwidth of about a GB/s allows
>> the copy to complete in less time that it takes distcp to get started.
>> >
>> > At larger sizes (100's of GB and up), the startup time of distcp
>> doesn't matter because once it gets going, it moves data much faster.
>>
>> Maybe we could put together a 'fs -smartcp' which choses wisely between
>> copy and distcp depending on file size
>>
>
> Uh... hmm...
>
> This is a good suggestion.  Obvious in fact.  In retrospect.
>
> I would also suggest that the new command be called "distcp".
>
>

Re: Copy Vs DistCP

Posted by Amal G Jose <am...@gmail.com>.

For copying large files, I prefer distcp.


On Sun, Apr 14, 2013 at 11:31 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
>
> On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
> mathias.herberts@gmail.com> wrote:
>
>>
>> >
>> > This is absolutely true.  Distcp dominates cp for large copies.  On the
>> other hand cp dominates distcp for convenience.
>> >
>> > In my own experience, I love cp when copying relatively small amounts
>> of data (10's of GB) where the available bandwidth of about a GB/s allows
>> the copy to complete in less time that it takes distcp to get started.
>> >
>> > At larger sizes (100's of GB and up), the startup time of distcp
>> doesn't matter because once it gets going, it moves data much faster.
>>
>> Maybe we could put together a 'fs -smartcp' which choses wisely between
>> copy and distcp depending on file size
>>
>
> Uh... hmm...
>
> This is a good suggestion.  Obvious in fact.  In retrospect.
>
> I would also suggest that the new command be called "distcp".
>
>

Re: Copy Vs DistCP

Posted by Amal G Jose <am...@gmail.com>.

For copying large files, I prefer distcp.


On Sun, Apr 14, 2013 at 11:31 PM, Ted Dunning <td...@maprtech.com> wrote:

>
>
>
> On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
> mathias.herberts@gmail.com> wrote:
>
>>
>> >
>> > This is absolutely true.  Distcp dominates cp for large copies.  On the
>> other hand cp dominates distcp for convenience.
>> >
>> > In my own experience, I love cp when copying relatively small amounts
>> of data (10's of GB) where the available bandwidth of about a GB/s allows
>> the copy to complete in less time that it takes distcp to get started.
>> >
>> > At larger sizes (100's of GB and up), the startup time of distcp
>> doesn't matter because once it gets going, it moves data much faster.
>>
>> Maybe we could put together a 'fs -smartcp' which choses wisely between
>> copy and distcp depending on file size
>>
>
> Uh... hmm...
>
> This is a good suggestion.  Obvious in fact.  In retrospect.
>
> I would also suggest that the new command be called "distcp".
>
>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

>
> >
> > This is absolutely true.  Distcp dominates cp for large copies.  On the
> other hand cp dominates distcp for convenience.
> >
> > In my own experience, I love cp when copying relatively small amounts of
> data (10's of GB) where the available bandwidth of about a GB/s allows the
> copy to complete in less time that it takes distcp to get started.
> >
> > At larger sizes (100's of GB and up), the startup time of distcp doesn't
> matter because once it gets going, it moves data much faster.
>
> Maybe we could put together a 'fs -smartcp' which choses wisely between
> copy and distcp depending on file size
>

Uh... hmm...

This is a good suggestion.  Obvious in fact.  In retrospect.

I would also suggest that the new command be called "distcp".

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

>
> >
> > This is absolutely true.  Distcp dominates cp for large copies.  On the
> other hand cp dominates distcp for convenience.
> >
> > In my own experience, I love cp when copying relatively small amounts of
> data (10's of GB) where the available bandwidth of about a GB/s allows the
> copy to complete in less time that it takes distcp to get started.
> >
> > At larger sizes (100's of GB and up), the startup time of distcp doesn't
> matter because once it gets going, it moves data much faster.
>
> Maybe we could put together a 'fs -smartcp' which choses wisely between
> copy and distcp depending on file size
>

Uh... hmm...

This is a good suggestion.  Obvious in fact.  In retrospect.

I would also suggest that the new command be called "distcp".

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

>
> >
> > This is absolutely true.  Distcp dominates cp for large copies.  On the
> other hand cp dominates distcp for convenience.
> >
> > In my own experience, I love cp when copying relatively small amounts of
> data (10's of GB) where the available bandwidth of about a GB/s allows the
> copy to complete in less time that it takes distcp to get started.
> >
> > At larger sizes (100's of GB and up), the startup time of distcp doesn't
> matter because once it gets going, it moves data much faster.
>
> Maybe we could put together a 'fs -smartcp' which choses wisely between
> copy and distcp depending on file size
>

Uh... hmm...

This is a good suggestion.  Obvious in fact.  In retrospect.

I would also suggest that the new command be called "distcp".

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

>
> >
> > This is absolutely true.  Distcp dominates cp for large copies.  On the
> other hand cp dominates distcp for convenience.
> >
> > In my own experience, I love cp when copying relatively small amounts of
> data (10's of GB) where the available bandwidth of about a GB/s allows the
> copy to complete in less time that it takes distcp to get started.
> >
> > At larger sizes (100's of GB and up), the startup time of distcp doesn't
> matter because once it gets going, it moves data much faster.
>
> Maybe we could put together a 'fs -smartcp' which choses wisely between
> copy and distcp depending on file size
>

Uh... hmm...

This is a good suggestion.  Obvious in fact.  In retrospect.

I would also suggest that the new command be called "distcp".

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

>
> This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.
>
> In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.
>
> At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.

Maybe we could put together a 'fs -smartcp' which choses wisely between
copy and distcp depending on file size

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

>
> This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.
>
> In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.
>
> At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.

Maybe we could put together a 'fs -smartcp' which choses wisely between
copy and distcp depending on file size

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

>
> This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.
>
> In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.
>
> At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.

Maybe we could put together a 'fs -smartcp' which choses wisely between
copy and distcp depending on file size

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

>
> This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.
>
> In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.
>
> At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.

Maybe we could put together a 'fs -smartcp' which choses wisely between
copy and distcp depending on file size

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Inline


On Sun, Apr 14, 2013 at 1:13 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> That was a hidden shameless plug Ted ;-)
>

Well, I will admit it was a shameless correction to Lance's absolute and
incorrect claim.


> The main disadvantage of fs -cp is that all data has to transit via the
> machine you issue the command on, depending on the size of data you want to
> copy that can be a killer. DistCp is distributed as its name imply, so no
> bottleneck of this kind then.
>

This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.

In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.

At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.





> On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>>
>> Lance,
>>
>> Never say never.
>>
>> Linux programs can read from the right kind of Hadoop cluster without
>> using FUSE.
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com>wrote:
>>
>>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>>> system visible as a Unix mounted file system. Otherwise, Unix programs
>>> cannot read or write HDFS files.
>>>
>>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>>
>>>    Summing up what would be the recommendations for copy
>>>
>>>  1) DistCP
>>>  2) shell cp command
>>>  3) Using File System API(FileUtils to be precise) inside of a Java
>>> program
>>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>>> DistCP does)
>>>
>>>
>>>  I did not run any comparisons as my dev cluster is just a two node
>>> cluster and not sure how this would perform on a production cluster.
>>>
>>>  Kay
>>>
>>>
>>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>>  Yes makes sense...  cp is serialized and simpler, and does not rely
>>>> on jobtracker- Whereas distcp actually only submits a job and waits for
>>>> completion.
>>>> So it can fail if tasks start to fail or timeout.
>>>>  I Have seen distcp fail and hang before albeit not often.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>>> wrote:
>>>>
>>>>   if cluster is busy with other jobs distcp will wait for free map
>>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>>> to copy just several GB
>>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>>
>>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>>> DFSClient has multi threads.
>>>>>
>>>>>  DistCp can work well on the same cluster.
>>>>>
>>>>>
>>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>>
>>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>>> them which could be significantly efficient?
>>>>>>
>>>>>>
>>>>>>  Also how does the cp command work if the file is distributed on
>>>>>> different data nodes??
>>>>>>
>>>>>>  Thanks
>>>>>>  Kay
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>>
>>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>>
>>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>>> and issues a copy command for every source file.
>>>>>>>
>>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>>> cluster optimized (if at all) ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>>
>>>>>>>>  Hi，
>>>>>>>>
>>>>>>>> I think it' better using Copy in the same cluster while using
>>>>>>>> distCP between clusters, and cp command is a hadoop internal parallel
>>>>>>>> process and will not copy files locally.
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>  麦树荣
>>>>>>>>
>>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>>> *To:* user@hadoop.apache.org
>>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>>> copying files in the same cluster.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) Which one is better within a  same cluster and what factors
>>>>>>>> (like file size etc) wouldinfluence the usage of one over te other?
>>>>>>>>
>>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>>       i) like an MR job
>>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>>> location.
>>>>>>>>
>>>>>>>>  Example of the copy command
>>>>>>>>
>>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>>
>>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>>
>>>>>>>>  -- Kay
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> Jay Vyas
>>>>>>> http://jayunit100.blogspot.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Inline


On Sun, Apr 14, 2013 at 1:13 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> That was a hidden shameless plug Ted ;-)
>

Well, I will admit it was a shameless correction to Lance's absolute and
incorrect claim.


> The main disadvantage of fs -cp is that all data has to transit via the
> machine you issue the command on, depending on the size of data you want to
> copy that can be a killer. DistCp is distributed as its name imply, so no
> bottleneck of this kind then.
>

This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.

In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.

At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.





> On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>>
>> Lance,
>>
>> Never say never.
>>
>> Linux programs can read from the right kind of Hadoop cluster without
>> using FUSE.
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com>wrote:
>>
>>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>>> system visible as a Unix mounted file system. Otherwise, Unix programs
>>> cannot read or write HDFS files.
>>>
>>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>>
>>>    Summing up what would be the recommendations for copy
>>>
>>>  1) DistCP
>>>  2) shell cp command
>>>  3) Using File System API(FileUtils to be precise) inside of a Java
>>> program
>>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>>> DistCP does)
>>>
>>>
>>>  I did not run any comparisons as my dev cluster is just a two node
>>> cluster and not sure how this would perform on a production cluster.
>>>
>>>  Kay
>>>
>>>
>>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>>  Yes makes sense...  cp is serialized and simpler, and does not rely
>>>> on jobtracker- Whereas distcp actually only submits a job and waits for
>>>> completion.
>>>> So it can fail if tasks start to fail or timeout.
>>>>  I Have seen distcp fail and hang before albeit not often.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>>> wrote:
>>>>
>>>>   if cluster is busy with other jobs distcp will wait for free map
>>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>>> to copy just several GB
>>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>>
>>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>>> DFSClient has multi threads.
>>>>>
>>>>>  DistCp can work well on the same cluster.
>>>>>
>>>>>
>>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>>
>>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>>> them which could be significantly efficient?
>>>>>>
>>>>>>
>>>>>>  Also how does the cp command work if the file is distributed on
>>>>>> different data nodes??
>>>>>>
>>>>>>  Thanks
>>>>>>  Kay
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>>
>>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>>
>>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>>> and issues a copy command for every source file.
>>>>>>>
>>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>>> cluster optimized (if at all) ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>>
>>>>>>>>  Hi，
>>>>>>>>
>>>>>>>> I think it' better using Copy in the same cluster while using
>>>>>>>> distCP between clusters, and cp command is a hadoop internal parallel
>>>>>>>> process and will not copy files locally.
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>  麦树荣
>>>>>>>>
>>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>>> *To:* user@hadoop.apache.org
>>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>>> copying files in the same cluster.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) Which one is better within a  same cluster and what factors
>>>>>>>> (like file size etc) wouldinfluence the usage of one over te other?
>>>>>>>>
>>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>>       i) like an MR job
>>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>>> location.
>>>>>>>>
>>>>>>>>  Example of the copy command
>>>>>>>>
>>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>>
>>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>>
>>>>>>>>  -- Kay
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> Jay Vyas
>>>>>>> http://jayunit100.blogspot.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Inline


On Sun, Apr 14, 2013 at 1:13 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> That was a hidden shameless plug Ted ;-)
>

Well, I will admit it was a shameless correction to Lance's absolute and
incorrect claim.


> The main disadvantage of fs -cp is that all data has to transit via the
> machine you issue the command on, depending on the size of data you want to
> copy that can be a killer. DistCp is distributed as its name imply, so no
> bottleneck of this kind then.
>

This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.

In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.

At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.





> On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>>
>> Lance,
>>
>> Never say never.
>>
>> Linux programs can read from the right kind of Hadoop cluster without
>> using FUSE.
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com>wrote:
>>
>>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>>> system visible as a Unix mounted file system. Otherwise, Unix programs
>>> cannot read or write HDFS files.
>>>
>>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>>
>>>    Summing up what would be the recommendations for copy
>>>
>>>  1) DistCP
>>>  2) shell cp command
>>>  3) Using File System API(FileUtils to be precise) inside of a Java
>>> program
>>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>>> DistCP does)
>>>
>>>
>>>  I did not run any comparisons as my dev cluster is just a two node
>>> cluster and not sure how this would perform on a production cluster.
>>>
>>>  Kay
>>>
>>>
>>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>>  Yes makes sense...  cp is serialized and simpler, and does not rely
>>>> on jobtracker- Whereas distcp actually only submits a job and waits for
>>>> completion.
>>>> So it can fail if tasks start to fail or timeout.
>>>>  I Have seen distcp fail and hang before albeit not often.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>>> wrote:
>>>>
>>>>   if cluster is busy with other jobs distcp will wait for free map
>>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>>> to copy just several GB
>>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>>
>>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>>> DFSClient has multi threads.
>>>>>
>>>>>  DistCp can work well on the same cluster.
>>>>>
>>>>>
>>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>>
>>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>>> them which could be significantly efficient?
>>>>>>
>>>>>>
>>>>>>  Also how does the cp command work if the file is distributed on
>>>>>> different data nodes??
>>>>>>
>>>>>>  Thanks
>>>>>>  Kay
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>>
>>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>>
>>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>>> and issues a copy command for every source file.
>>>>>>>
>>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>>> cluster optimized (if at all) ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>>
>>>>>>>>  Hi，
>>>>>>>>
>>>>>>>> I think it' better using Copy in the same cluster while using
>>>>>>>> distCP between clusters, and cp command is a hadoop internal parallel
>>>>>>>> process and will not copy files locally.
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>  麦树荣
>>>>>>>>
>>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>>> *To:* user@hadoop.apache.org
>>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>>> copying files in the same cluster.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) Which one is better within a  same cluster and what factors
>>>>>>>> (like file size etc) wouldinfluence the usage of one over te other?
>>>>>>>>
>>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>>       i) like an MR job
>>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>>> location.
>>>>>>>>
>>>>>>>>  Example of the copy command
>>>>>>>>
>>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>>
>>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>>
>>>>>>>>  -- Kay
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> Jay Vyas
>>>>>>> http://jayunit100.blogspot.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Inline


On Sun, Apr 14, 2013 at 1:13 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> That was a hidden shameless plug Ted ;-)
>

Well, I will admit it was a shameless correction to Lance's absolute and
incorrect claim.


> The main disadvantage of fs -cp is that all data has to transit via the
> machine you issue the command on, depending on the size of data you want to
> copy that can be a killer. DistCp is distributed as its name imply, so no
> bottleneck of this kind then.
>

This is absolutely true.  Distcp dominates cp for large copies.  On the
other hand cp dominates distcp for convenience.

In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less time that it takes distcp to get started.

At larger sizes (100's of GB and up), the startup time of distcp doesn't
matter because once it gets going, it moves data much faster.





> On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>>
>> Lance,
>>
>> Never say never.
>>
>> Linux programs can read from the right kind of Hadoop cluster without
>> using FUSE.
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com>wrote:
>>
>>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>>> system visible as a Unix mounted file system. Otherwise, Unix programs
>>> cannot read or write HDFS files.
>>>
>>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>>
>>>    Summing up what would be the recommendations for copy
>>>
>>>  1) DistCP
>>>  2) shell cp command
>>>  3) Using File System API(FileUtils to be precise) inside of a Java
>>> program
>>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>>> DistCP does)
>>>
>>>
>>>  I did not run any comparisons as my dev cluster is just a two node
>>> cluster and not sure how this would perform on a production cluster.
>>>
>>>  Kay
>>>
>>>
>>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>>  Yes makes sense...  cp is serialized and simpler, and does not rely
>>>> on jobtracker- Whereas distcp actually only submits a job and waits for
>>>> completion.
>>>> So it can fail if tasks start to fail or timeout.
>>>>  I Have seen distcp fail and hang before albeit not often.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>>> wrote:
>>>>
>>>>   if cluster is busy with other jobs distcp will wait for free map
>>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>>> to copy just several GB
>>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>>
>>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>>> DFSClient has multi threads.
>>>>>
>>>>>  DistCp can work well on the same cluster.
>>>>>
>>>>>
>>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>>
>>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>>> them which could be significantly efficient?
>>>>>>
>>>>>>
>>>>>>  Also how does the cp command work if the file is distributed on
>>>>>> different data nodes??
>>>>>>
>>>>>>  Thanks
>>>>>>  Kay
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>>
>>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>>
>>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>>> and issues a copy command for every source file.
>>>>>>>
>>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>>> cluster optimized (if at all) ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>>
>>>>>>>>  Hi，
>>>>>>>>
>>>>>>>> I think it' better using Copy in the same cluster while using
>>>>>>>> distCP between clusters, and cp command is a hadoop internal parallel
>>>>>>>> process and will not copy files locally.
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>  麦树荣
>>>>>>>>
>>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>>> *To:* user@hadoop.apache.org
>>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>>> copying files in the same cluster.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) Which one is better within a  same cluster and what factors
>>>>>>>> (like file size etc) wouldinfluence the usage of one over te other?
>>>>>>>>
>>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>>       i) like an MR job
>>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>>> location.
>>>>>>>>
>>>>>>>>  Example of the copy command
>>>>>>>>
>>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>>
>>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>>
>>>>>>>>  -- Kay
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> Jay Vyas
>>>>>>> http://jayunit100.blogspot.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

That was a hidden shameless plug Ted ;-)

The main disadvantage of fs -cp is that all data has to transit via the
machine you issue the command on, depending on the size of data you want to
copy that can be a killer. DistCp is distributed as its name imply, so no
bottleneck of this kind then.
On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:

>
> Lance,
>
> Never say never.
>
> Linux programs can read from the right kind of Hadoop cluster without
> using FUSE.
>
>
>
>
> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:
>
>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>> system visible as a Unix mounted file system. Otherwise, Unix programs
>> cannot read or write HDFS files.
>>
>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>
>>    Summing up what would be the recommendations for copy
>>
>>  1) DistCP
>>  2) shell cp command
>>  3) Using File System API(FileUtils to be precise) inside of a Java
>> program
>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>> DistCP does)
>>
>>
>>  I did not run any comparisons as my dev cluster is just a two node
>> cluster and not sure how this would perform on a production cluster.
>>
>>  Kay
>>
>>
>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>>> jobtracker- Whereas distcp actually only submits a job and waits for
>>> completion.
>>> So it can fail if tasks start to fail or timeout.
>>>  I Have seen distcp fail and hang before albeit not often.
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>> wrote:
>>>
>>>   if cluster is busy with other jobs distcp will wait for free map
>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>> to copy just several GB
>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>
>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>> DFSClient has multi threads.
>>>>
>>>>  DistCp can work well on the same cluster.
>>>>
>>>>
>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>
>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>> them which could be significantly efficient?
>>>>>
>>>>>
>>>>>  Also how does the cp command work if the file is distributed on
>>>>> different data nodes??
>>>>>
>>>>>  Thanks
>>>>>  Kay
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>
>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>
>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>> and issues a copy command for every source file.
>>>>>>
>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>> cluster optimized (if at all) ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>
>>>>>>>  Hi，
>>>>>>>
>>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>>> will not copy files locally.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>  麦树荣
>>>>>>>
>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>> copying files in the same cluster.
>>>>>>>
>>>>>>>
>>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>>
>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>       i) like an MR job
>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>> location.
>>>>>>>
>>>>>>>  Example of the copy command
>>>>>>>
>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>
>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>
>>>>>>>  -- Kay
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

That was a hidden shameless plug Ted ;-)

The main disadvantage of fs -cp is that all data has to transit via the
machine you issue the command on, depending on the size of data you want to
copy that can be a killer. DistCp is distributed as its name imply, so no
bottleneck of this kind then.
On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:

>
> Lance,
>
> Never say never.
>
> Linux programs can read from the right kind of Hadoop cluster without
> using FUSE.
>
>
>
>
> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:
>
>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>> system visible as a Unix mounted file system. Otherwise, Unix programs
>> cannot read or write HDFS files.
>>
>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>
>>    Summing up what would be the recommendations for copy
>>
>>  1) DistCP
>>  2) shell cp command
>>  3) Using File System API(FileUtils to be precise) inside of a Java
>> program
>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>> DistCP does)
>>
>>
>>  I did not run any comparisons as my dev cluster is just a two node
>> cluster and not sure how this would perform on a production cluster.
>>
>>  Kay
>>
>>
>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>>> jobtracker- Whereas distcp actually only submits a job and waits for
>>> completion.
>>> So it can fail if tasks start to fail or timeout.
>>>  I Have seen distcp fail and hang before albeit not often.
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>> wrote:
>>>
>>>   if cluster is busy with other jobs distcp will wait for free map
>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>> to copy just several GB
>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>
>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>> DFSClient has multi threads.
>>>>
>>>>  DistCp can work well on the same cluster.
>>>>
>>>>
>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>
>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>> them which could be significantly efficient?
>>>>>
>>>>>
>>>>>  Also how does the cp command work if the file is distributed on
>>>>> different data nodes??
>>>>>
>>>>>  Thanks
>>>>>  Kay
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>
>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>
>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>> and issues a copy command for every source file.
>>>>>>
>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>> cluster optimized (if at all) ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>
>>>>>>>  Hi，
>>>>>>>
>>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>>> will not copy files locally.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>  麦树荣
>>>>>>>
>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>> copying files in the same cluster.
>>>>>>>
>>>>>>>
>>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>>
>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>       i) like an MR job
>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>> location.
>>>>>>>
>>>>>>>  Example of the copy command
>>>>>>>
>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>
>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>
>>>>>>>  -- Kay
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

That was a hidden shameless plug Ted ;-)

The main disadvantage of fs -cp is that all data has to transit via the
machine you issue the command on, depending on the size of data you want to
copy that can be a killer. DistCp is distributed as its name imply, so no
bottleneck of this kind then.
On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:

>
> Lance,
>
> Never say never.
>
> Linux programs can read from the right kind of Hadoop cluster without
> using FUSE.
>
>
>
>
> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:
>
>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>> system visible as a Unix mounted file system. Otherwise, Unix programs
>> cannot read or write HDFS files.
>>
>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>
>>    Summing up what would be the recommendations for copy
>>
>>  1) DistCP
>>  2) shell cp command
>>  3) Using File System API(FileUtils to be precise) inside of a Java
>> program
>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>> DistCP does)
>>
>>
>>  I did not run any comparisons as my dev cluster is just a two node
>> cluster and not sure how this would perform on a production cluster.
>>
>>  Kay
>>
>>
>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>>> jobtracker- Whereas distcp actually only submits a job and waits for
>>> completion.
>>> So it can fail if tasks start to fail or timeout.
>>>  I Have seen distcp fail and hang before albeit not often.
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>> wrote:
>>>
>>>   if cluster is busy with other jobs distcp will wait for free map
>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>> to copy just several GB
>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>
>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>> DFSClient has multi threads.
>>>>
>>>>  DistCp can work well on the same cluster.
>>>>
>>>>
>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>
>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>> them which could be significantly efficient?
>>>>>
>>>>>
>>>>>  Also how does the cp command work if the file is distributed on
>>>>> different data nodes??
>>>>>
>>>>>  Thanks
>>>>>  Kay
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>
>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>
>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>> and issues a copy command for every source file.
>>>>>>
>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>> cluster optimized (if at all) ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>
>>>>>>>  Hi，
>>>>>>>
>>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>>> will not copy files locally.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>  麦树荣
>>>>>>>
>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>> copying files in the same cluster.
>>>>>>>
>>>>>>>
>>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>>
>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>       i) like an MR job
>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>> location.
>>>>>>>
>>>>>>>  Example of the copy command
>>>>>>>
>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>
>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>
>>>>>>>  -- Kay
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Mathias Herberts <ma...@gmail.com>.

That was a hidden shameless plug Ted ;-)

The main disadvantage of fs -cp is that all data has to transit via the
machine you issue the command on, depending on the size of data you want to
copy that can be a killer. DistCp is distributed as its name imply, so no
bottleneck of this kind then.
On Apr 14, 2013 6:15 AM, "Ted Dunning" <td...@maprtech.com> wrote:

>
> Lance,
>
> Never say never.
>
> Linux programs can read from the right kind of Hadoop cluster without
> using FUSE.
>
>
>
>
> On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:
>
>>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file
>> system visible as a Unix mounted file system. Otherwise, Unix programs
>> cannot read or write HDFS files.
>>
>> On 04/11/2013 09:52 AM, KayVajj wrote:
>>
>>    Summing up what would be the recommendations for copy
>>
>>  1) DistCP
>>  2) shell cp command
>>  3) Using File System API(FileUtils to be precise) inside of a Java
>> program
>>  4) A MR with an Identity Mapper and no Reducer (may be this is what
>> DistCP does)
>>
>>
>>  I did not run any comparisons as my dev cluster is just a two node
>> cluster and not sure how this would perform on a production cluster.
>>
>>  Kay
>>
>>
>> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>>> jobtracker- Whereas distcp actually only submits a job and waits for
>>> completion.
>>> So it can fail if tasks start to fail or timeout.
>>>  I Have seen distcp fail and hang before albeit not often.
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>>> wrote:
>>>
>>>   if cluster is busy with other jobs distcp will wait for free map
>>> slots. Regular cp is more reliable and predictable. Especialy if you need
>>> to copy just several GB
>>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>>
>>>>  CP command is not parallel, It's just call FileSystem, even if
>>>> DFSClient has multi threads.
>>>>
>>>>  DistCp can work well on the same cluster.
>>>>
>>>>
>>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com>wrote:
>>>>
>>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>>> them which could be significantly efficient?
>>>>>
>>>>>
>>>>>  Also how does the cp command work if the file is distributed on
>>>>> different data nodes??
>>>>>
>>>>>  Thanks
>>>>>  Kay
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com>wrote:
>>>>>
>>>>>>  DistCP is a full blown mapreduce job (mapper only, where the
>>>>>> mappers do a "fully" parallel copy to the detsination).
>>>>>>
>>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>>> and issues a copy command for every source file.
>>>>>>
>>>>>>  I have an additional question: how is CP which is internal to a
>>>>>> cluster optimized (if at all) ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>>
>>>>>>>  Hi，
>>>>>>>
>>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>>> will not copy files locally.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>  麦树荣
>>>>>>>
>>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>>> *Date:* 2013-04-11 06:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* Copy Vs DistCP
>>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>>> copying files in the same cluster.
>>>>>>>
>>>>>>>
>>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>>
>>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>>> cluster (not a data node), How does the cp command work
>>>>>>>       i) like an MR job
>>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>>> location.
>>>>>>>
>>>>>>>  Example of the copy command
>>>>>>>
>>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>>
>>>>>>>  Thanks, your responses are appreciated.
>>>>>>>
>>>>>>>  -- Kay
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Lance,

Never say never.

Linux programs can read from the right kind of Hadoop cluster without using
FUSE.




On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:

>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file system
> visible as a Unix mounted file system. Otherwise, Unix programs cannot read
> or write HDFS files.
>
> On 04/11/2013 09:52 AM, KayVajj wrote:
>
>    Summing up what would be the recommendations for copy
>
>  1) DistCP
>  2) shell cp command
>  3) Using File System API(FileUtils to be precise) inside of a Java program
>  4) A MR with an Identity Mapper and no Reducer (may be this is what
> DistCP does)
>
>
>  I did not run any comparisons as my dev cluster is just a two node
> cluster and not sure how this would perform on a production cluster.
>
>  Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>>   if cluster is busy with other jobs distcp will wait for free map
>> slots. Regular cp is more reliable and predictable. Especialy if you need
>> to copy just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>>  CP command is not parallel, It's just call FileSystem, even if
>>> DFSClient has multi threads.
>>>
>>>  DistCp can work well on the same cluster.
>>>
>>>
>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>>  Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>>  Thanks
>>>>  Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>>  DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>>  I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>>  Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>> copying files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>
>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Lance,

Never say never.

Linux programs can read from the right kind of Hadoop cluster without using
FUSE.




On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:

>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file system
> visible as a Unix mounted file system. Otherwise, Unix programs cannot read
> or write HDFS files.
>
> On 04/11/2013 09:52 AM, KayVajj wrote:
>
>    Summing up what would be the recommendations for copy
>
>  1) DistCP
>  2) shell cp command
>  3) Using File System API(FileUtils to be precise) inside of a Java program
>  4) A MR with an Identity Mapper and no Reducer (may be this is what
> DistCP does)
>
>
>  I did not run any comparisons as my dev cluster is just a two node
> cluster and not sure how this would perform on a production cluster.
>
>  Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>>   if cluster is busy with other jobs distcp will wait for free map
>> slots. Regular cp is more reliable and predictable. Especialy if you need
>> to copy just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>>  CP command is not parallel, It's just call FileSystem, even if
>>> DFSClient has multi threads.
>>>
>>>  DistCp can work well on the same cluster.
>>>
>>>
>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>>  Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>>  Thanks
>>>>  Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>>  DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>>  I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>>  Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>> copying files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>
>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Lance,

Never say never.

Linux programs can read from the right kind of Hadoop cluster without using
FUSE.




On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:

>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file system
> visible as a Unix mounted file system. Otherwise, Unix programs cannot read
> or write HDFS files.
>
> On 04/11/2013 09:52 AM, KayVajj wrote:
>
>    Summing up what would be the recommendations for copy
>
>  1) DistCP
>  2) shell cp command
>  3) Using File System API(FileUtils to be precise) inside of a Java program
>  4) A MR with an Identity Mapper and no Reducer (may be this is what
> DistCP does)
>
>
>  I did not run any comparisons as my dev cluster is just a two node
> cluster and not sure how this would perform on a production cluster.
>
>  Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>>   if cluster is busy with other jobs distcp will wait for free map
>> slots. Regular cp is more reliable and predictable. Especialy if you need
>> to copy just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>>  CP command is not parallel, It's just call FileSystem, even if
>>> DFSClient has multi threads.
>>>
>>>  DistCp can work well on the same cluster.
>>>
>>>
>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>>  Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>>  Thanks
>>>>  Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>>  DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>>  I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>>  Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>> copying files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>
>

Re: Copy Vs DistCP

Posted by Ted Dunning <td...@maprtech.com>.

Lance,

Never say never.

Linux programs can read from the right kind of Hadoop cluster without using
FUSE.




On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog <go...@gmail.com> wrote:

>  Shell 'cp' only works if you use 'fuse', which makes the HDFS file system
> visible as a Unix mounted file system. Otherwise, Unix programs cannot read
> or write HDFS files.
>
> On 04/11/2013 09:52 AM, KayVajj wrote:
>
>    Summing up what would be the recommendations for copy
>
>  1) DistCP
>  2) shell cp command
>  3) Using File System API(FileUtils to be precise) inside of a Java program
>  4) A MR with an Identity Mapper and no Reducer (may be this is what
> DistCP does)
>
>
>  I did not run any comparisons as my dev cluster is just a two node
> cluster and not sure how this would perform on a production cluster.
>
>  Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>>  Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>>   if cluster is busy with other jobs distcp will wait for free map
>> slots. Regular cp is more reliable and predictable. Especialy if you need
>> to copy just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>>  CP command is not parallel, It's just call FileSystem, even if
>>> DFSClient has multi threads.
>>>
>>>  DistCp can work well on the same cluster.
>>>
>>>
>>>  On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>>  The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>>  Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>>  Thanks
>>>>  Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>>  DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>>  I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>>  On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>>  Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>        I have few questions regarding the usage of DistCP for
>>>>>> copying files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>
>

Re: Copy Vs DistCP

Posted by Lance Norskog <go...@gmail.com>.

Shell 'cp' only works if you use 'fuse', which makes the HDFS file 
system visible as a Unix mounted file system. Otherwise, Unix programs 
cannot read or write HDFS files.

On 04/11/2013 09:52 AM, KayVajj wrote:
> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what 
> DistCP does)
>
>
> I did not run any comparisons as my dev cluster is just a two node 
> cluster and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <jayunit100@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Yes makes sense...  cp is serialized and simpler, and does not
>     rely on jobtracker- Whereas distcp actually only submits a job and
>     waits for completion.
>     So it can fail if tasks start to fail or timeout.
>      I Have seen distcp fail and hang before albeit not often.
>
>     Sent from my iPhone
>
>     On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
>     <apivovarov@gmail.com <ma...@gmail.com>> wrote:
>
>>     if cluster is busy with other jobs distcp will wait for free map
>>     slots. Regular cp is more reliable and predictable. Especialy if
>>     you need to copy just several GB
>>
>>     On Apr 10, 2013 6:31 PM, "Azuryy Yu" <azuryyyu@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         CP command is not parallel, It's just call FileSystem, even
>>         if DFSClient has multi threads.
>>
>>         DistCp can work well on the same cluster.
>>
>>
>>         On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
>>         <vajjalak009@gmail.com <ma...@gmail.com>> wrote:
>>
>>             The File System Copy utility copies files byte by byte if
>>             I'm not wrong. Could it be possible that the cp command
>>             works with blocks and moves them which could be
>>             significantly efficient?
>>
>>
>>             Also how does the cp command work if the file is
>>             distributed on different data nodes??
>>
>>             Thanks
>>             Kay
>>
>>
>>             On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
>>             <jayunit100@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 DistCP is a full blown mapreduce job (mapper only,
>>                 where the mappers do a "fully" parallel copy to the
>>                 detsination).
>>
>>                 CP appears (correct me if im wrong) to simply invoke
>>                 the FileSystem and issues a copy command for every
>>                 source file.
>>
>>                 I have an additional question: how is CP which is
>>                 internal to a cluster optimized (if at all) ?
>>
>>
>>
>>                 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
>>                 <shurong.mai@qunar.com
>>                 <ma...@qunar.com>> wrote:
>>
>>                     Hi，
>>                     I think it' better using Copy in the same cluster
>>                     while using distCP between clusters, and cp
>>                     command is a hadoop internal parallel process and
>>                     will not copy files locally.
>>                     ------------------------------------------------------------------------
>>                     麦树荣
>>                     *From:* KayVajj <ma...@gmail.com>
>>                     *Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
>>                     *To:* user@hadoop.apache.org
>>                     <ma...@hadoop.apache.org>
>>                     *Subject:* Copy Vs DistCP
>>                     I have few questions regarding the usage of
>>                     DistCP for copying files in the same cluster.
>>
>>
>>                     1) Which one is better within a  same cluster and
>>                     what factors (like file size etc) wouldinfluence
>>                     the usage of one over te other?
>>
>>                     2) when we run a cp command like below from a 
>>                     client node of the cluster (not a data node), How
>>                     does the cp command work
>>                          i) like an MR job
>>                         ii) copy files locally and then it copy it
>>                     back at the new location.
>>
>>                     Example of the copy command
>>
>>                     hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>                     Thanks, your responses are appreciated.
>>
>>                     -- Kay
>>
>>
>>
>>
>>                 -- 
>>                 Jay Vyas
>>                 http://jayunit100.blogspot.com
>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Lance Norskog <go...@gmail.com>.

Shell 'cp' only works if you use 'fuse', which makes the HDFS file 
system visible as a Unix mounted file system. Otherwise, Unix programs 
cannot read or write HDFS files.

On 04/11/2013 09:52 AM, KayVajj wrote:
> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what 
> DistCP does)
>
>
> I did not run any comparisons as my dev cluster is just a two node 
> cluster and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <jayunit100@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Yes makes sense...  cp is serialized and simpler, and does not
>     rely on jobtracker- Whereas distcp actually only submits a job and
>     waits for completion.
>     So it can fail if tasks start to fail or timeout.
>      I Have seen distcp fail and hang before albeit not often.
>
>     Sent from my iPhone
>
>     On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
>     <apivovarov@gmail.com <ma...@gmail.com>> wrote:
>
>>     if cluster is busy with other jobs distcp will wait for free map
>>     slots. Regular cp is more reliable and predictable. Especialy if
>>     you need to copy just several GB
>>
>>     On Apr 10, 2013 6:31 PM, "Azuryy Yu" <azuryyyu@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         CP command is not parallel, It's just call FileSystem, even
>>         if DFSClient has multi threads.
>>
>>         DistCp can work well on the same cluster.
>>
>>
>>         On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
>>         <vajjalak009@gmail.com <ma...@gmail.com>> wrote:
>>
>>             The File System Copy utility copies files byte by byte if
>>             I'm not wrong. Could it be possible that the cp command
>>             works with blocks and moves them which could be
>>             significantly efficient?
>>
>>
>>             Also how does the cp command work if the file is
>>             distributed on different data nodes??
>>
>>             Thanks
>>             Kay
>>
>>
>>             On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
>>             <jayunit100@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 DistCP is a full blown mapreduce job (mapper only,
>>                 where the mappers do a "fully" parallel copy to the
>>                 detsination).
>>
>>                 CP appears (correct me if im wrong) to simply invoke
>>                 the FileSystem and issues a copy command for every
>>                 source file.
>>
>>                 I have an additional question: how is CP which is
>>                 internal to a cluster optimized (if at all) ?
>>
>>
>>
>>                 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
>>                 <shurong.mai@qunar.com
>>                 <ma...@qunar.com>> wrote:
>>
>>                     Hi，
>>                     I think it' better using Copy in the same cluster
>>                     while using distCP between clusters, and cp
>>                     command is a hadoop internal parallel process and
>>                     will not copy files locally.
>>                     ------------------------------------------------------------------------
>>                     麦树荣
>>                     *From:* KayVajj <ma...@gmail.com>
>>                     *Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
>>                     *To:* user@hadoop.apache.org
>>                     <ma...@hadoop.apache.org>
>>                     *Subject:* Copy Vs DistCP
>>                     I have few questions regarding the usage of
>>                     DistCP for copying files in the same cluster.
>>
>>
>>                     1) Which one is better within a  same cluster and
>>                     what factors (like file size etc) wouldinfluence
>>                     the usage of one over te other?
>>
>>                     2) when we run a cp command like below from a 
>>                     client node of the cluster (not a data node), How
>>                     does the cp command work
>>                          i) like an MR job
>>                         ii) copy files locally and then it copy it
>>                     back at the new location.
>>
>>                     Example of the copy command
>>
>>                     hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>                     Thanks, your responses are appreciated.
>>
>>                     -- Kay
>>
>>
>>
>>
>>                 -- 
>>                 Jay Vyas
>>                 http://jayunit100.blogspot.com
>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Lance Norskog <go...@gmail.com>.

Shell 'cp' only works if you use 'fuse', which makes the HDFS file 
system visible as a Unix mounted file system. Otherwise, Unix programs 
cannot read or write HDFS files.

On 04/11/2013 09:52 AM, KayVajj wrote:
> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what 
> DistCP does)
>
>
> I did not run any comparisons as my dev cluster is just a two node 
> cluster and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <jayunit100@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Yes makes sense...  cp is serialized and simpler, and does not
>     rely on jobtracker- Whereas distcp actually only submits a job and
>     waits for completion.
>     So it can fail if tasks start to fail or timeout.
>      I Have seen distcp fail and hang before albeit not often.
>
>     Sent from my iPhone
>
>     On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
>     <apivovarov@gmail.com <ma...@gmail.com>> wrote:
>
>>     if cluster is busy with other jobs distcp will wait for free map
>>     slots. Regular cp is more reliable and predictable. Especialy if
>>     you need to copy just several GB
>>
>>     On Apr 10, 2013 6:31 PM, "Azuryy Yu" <azuryyyu@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         CP command is not parallel, It's just call FileSystem, even
>>         if DFSClient has multi threads.
>>
>>         DistCp can work well on the same cluster.
>>
>>
>>         On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
>>         <vajjalak009@gmail.com <ma...@gmail.com>> wrote:
>>
>>             The File System Copy utility copies files byte by byte if
>>             I'm not wrong. Could it be possible that the cp command
>>             works with blocks and moves them which could be
>>             significantly efficient?
>>
>>
>>             Also how does the cp command work if the file is
>>             distributed on different data nodes??
>>
>>             Thanks
>>             Kay
>>
>>
>>             On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
>>             <jayunit100@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 DistCP is a full blown mapreduce job (mapper only,
>>                 where the mappers do a "fully" parallel copy to the
>>                 detsination).
>>
>>                 CP appears (correct me if im wrong) to simply invoke
>>                 the FileSystem and issues a copy command for every
>>                 source file.
>>
>>                 I have an additional question: how is CP which is
>>                 internal to a cluster optimized (if at all) ?
>>
>>
>>
>>                 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
>>                 <shurong.mai@qunar.com
>>                 <ma...@qunar.com>> wrote:
>>
>>                     Hi，
>>                     I think it' better using Copy in the same cluster
>>                     while using distCP between clusters, and cp
>>                     command is a hadoop internal parallel process and
>>                     will not copy files locally.
>>                     ------------------------------------------------------------------------
>>                     麦树荣
>>                     *From:* KayVajj <ma...@gmail.com>
>>                     *Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
>>                     *To:* user@hadoop.apache.org
>>                     <ma...@hadoop.apache.org>
>>                     *Subject:* Copy Vs DistCP
>>                     I have few questions regarding the usage of
>>                     DistCP for copying files in the same cluster.
>>
>>
>>                     1) Which one is better within a  same cluster and
>>                     what factors (like file size etc) wouldinfluence
>>                     the usage of one over te other?
>>
>>                     2) when we run a cp command like below from a 
>>                     client node of the cluster (not a data node), How
>>                     does the cp command work
>>                          i) like an MR job
>>                         ii) copy files locally and then it copy it
>>                     back at the new location.
>>
>>                     Example of the copy command
>>
>>                     hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>                     Thanks, your responses are appreciated.
>>
>>                     -- Kay
>>
>>
>>
>>
>>                 -- 
>>                 Jay Vyas
>>                 http://jayunit100.blogspot.com
>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

DistCP is prefer for your requirements.


On Fri, Apr 12, 2013 at 12:52 AM, KayVajj <va...@gmail.com> wrote:

> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
> does)
>
>
> I did not run any comparisons as my dev cluster is just a two node cluster
> and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>> if cluster is busy with other jobs distcp will wait for free map slots.
>> Regular cp is more reliable and predictable. Especialy if you need to copy
>> just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, ������ <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi��
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  ������
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>

Re: Copy Vs DistCP

Posted by Lance Norskog <go...@gmail.com>.

Shell 'cp' only works if you use 'fuse', which makes the HDFS file 
system visible as a Unix mounted file system. Otherwise, Unix programs 
cannot read or write HDFS files.

On 04/11/2013 09:52 AM, KayVajj wrote:
> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what 
> DistCP does)
>
>
> I did not run any comparisons as my dev cluster is just a two node 
> cluster and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <jayunit100@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Yes makes sense...  cp is serialized and simpler, and does not
>     rely on jobtracker- Whereas distcp actually only submits a job and
>     waits for completion.
>     So it can fail if tasks start to fail or timeout.
>      I Have seen distcp fail and hang before albeit not often.
>
>     Sent from my iPhone
>
>     On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov
>     <apivovarov@gmail.com <ma...@gmail.com>> wrote:
>
>>     if cluster is busy with other jobs distcp will wait for free map
>>     slots. Regular cp is more reliable and predictable. Especialy if
>>     you need to copy just several GB
>>
>>     On Apr 10, 2013 6:31 PM, "Azuryy Yu" <azuryyyu@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         CP command is not parallel, It's just call FileSystem, even
>>         if DFSClient has multi threads.
>>
>>         DistCp can work well on the same cluster.
>>
>>
>>         On Thu, Apr 11, 2013 at 8:17 AM, KayVajj
>>         <vajjalak009@gmail.com <ma...@gmail.com>> wrote:
>>
>>             The File System Copy utility copies files byte by byte if
>>             I'm not wrong. Could it be possible that the cp command
>>             works with blocks and moves them which could be
>>             significantly efficient?
>>
>>
>>             Also how does the cp command work if the file is
>>             distributed on different data nodes??
>>
>>             Thanks
>>             Kay
>>
>>
>>             On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas
>>             <jayunit100@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 DistCP is a full blown mapreduce job (mapper only,
>>                 where the mappers do a "fully" parallel copy to the
>>                 detsination).
>>
>>                 CP appears (correct me if im wrong) to simply invoke
>>                 the FileSystem and issues a copy command for every
>>                 source file.
>>
>>                 I have an additional question: how is CP which is
>>                 internal to a cluster optimized (if at all) ?
>>
>>
>>
>>                 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣
>>                 <shurong.mai@qunar.com
>>                 <ma...@qunar.com>> wrote:
>>
>>                     Hi，
>>                     I think it' better using Copy in the same cluster
>>                     while using distCP between clusters, and cp
>>                     command is a hadoop internal parallel process and
>>                     will not copy files locally.
>>                     ------------------------------------------------------------------------
>>                     麦树荣
>>                     *From:* KayVajj <ma...@gmail.com>
>>                     *Date:* 2013-04-11 06 <tel:2013-04-11%C2%A006>:20
>>                     *To:* user@hadoop.apache.org
>>                     <ma...@hadoop.apache.org>
>>                     *Subject:* Copy Vs DistCP
>>                     I have few questions regarding the usage of
>>                     DistCP for copying files in the same cluster.
>>
>>
>>                     1) Which one is better within a  same cluster and
>>                     what factors (like file size etc) wouldinfluence
>>                     the usage of one over te other?
>>
>>                     2) when we run a cp command like below from a 
>>                     client node of the cluster (not a data node), How
>>                     does the cp command work
>>                          i) like an MR job
>>                         ii) copy files locally and then it copy it
>>                     back at the new location.
>>
>>                     Example of the copy command
>>
>>                     hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>                     Thanks, your responses are appreciated.
>>
>>                     -- Kay
>>
>>
>>
>>
>>                 -- 
>>                 Jay Vyas
>>                 http://jayunit100.blogspot.com
>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

DistCP is prefer for your requirements.


On Fri, Apr 12, 2013 at 12:52 AM, KayVajj <va...@gmail.com> wrote:

> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
> does)
>
>
> I did not run any comparisons as my dev cluster is just a two node cluster
> and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>> if cluster is busy with other jobs distcp will wait for free map slots.
>> Regular cp is more reliable and predictable. Especialy if you need to copy
>> just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

DistCP is prefer for your requirements.


On Fri, Apr 12, 2013 at 12:52 AM, KayVajj <va...@gmail.com> wrote:

> Summing up what would be the recommendations for copy
>
> 1) DistCP
> 2) shell cp command
> 3) Using File System API(FileUtils to be precise) inside of a Java program
> 4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
> does)
>
>
> I did not run any comparisons as my dev cluster is just a two node cluster
> and not sure how this would perform on a production cluster.
>
> Kay
>
>
> On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:
>
>> Yes makes sense...  cp is serialized and simpler, and does not rely on
>> jobtracker- Whereas distcp actually only submits a job and waits for
>> completion.
>> So it can fail if tasks start to fail or timeout.
>>  I Have seen distcp fail and hang before albeit not often.
>>
>> Sent from my iPhone
>>
>> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
>> wrote:
>>
>> if cluster is busy with other jobs distcp will wait for free map slots.
>> Regular cp is more reliable and predictable. Especialy if you need to copy
>> just several GB
>> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>>
>>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>>> has multi threads.
>>>
>>> DistCp can work well on the same cluster.
>>>
>>>
>>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>>
>>>> The File System Copy utility copies files byte by byte if I'm not
>>>> wrong. Could it be possible that the cp command works with blocks and moves
>>>> them which could be significantly efficient?
>>>>
>>>>
>>>> Also how does the cp command work if the file is distributed on
>>>> different data nodes??
>>>>
>>>> Thanks
>>>> Kay
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>>
>>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers
>>>>> do a "fully" parallel copy to the detsination).
>>>>>
>>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem
>>>>> and issues a copy command for every source file.
>>>>>
>>>>> I have an additional question: how is CP which is internal to a
>>>>> cluster optimized (if at all) ?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi，
>>>>>>
>>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>>> will not copy files locally.
>>>>>>
>>>>>> ------------------------------
>>>>>>  麦树荣
>>>>>>
>>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>>> *Date:* 2013-04-11 06:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Copy Vs DistCP
>>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>>> files in the same cluster.
>>>>>>
>>>>>>
>>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>>
>>>>>>  2) when we run a cp command like below from a  client node of the
>>>>>> cluster (not a data node), How does the cp command work
>>>>>>       i) like an MR job
>>>>>>      ii) copy files locally and then it copy it back at the new
>>>>>> location.
>>>>>>
>>>>>>  Example of the copy command
>>>>>>
>>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>>
>>>>>>  Thanks, your responses are appreciated.
>>>>>>
>>>>>>  -- Kay
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jay Vyas
>>>>> http://jayunit100.blogspot.com
>>>>>
>>>>
>>>>
>>>
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
does)


I did not run any comparisons as my dev cluster is just a two node cluster
and not sure how this would perform on a production cluster.

Kay


On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes makes sense...  cp is serialized and simpler, and does not rely on
> jobtracker- Whereas distcp actually only submits a job and waits for
> completion.
> So it can fail if tasks start to fail or timeout.
>  I Have seen distcp fail and hang before albeit not often.
>
> Sent from my iPhone
>
> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
> if cluster is busy with other jobs distcp will wait for free map slots.
> Regular cp is more reliable and predictable. Especialy if you need to copy
> just several GB
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, ������ <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi��
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  ������
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
does)


I did not run any comparisons as my dev cluster is just a two node cluster
and not sure how this would perform on a production cluster.

Kay


On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes makes sense...  cp is serialized and simpler, and does not rely on
> jobtracker- Whereas distcp actually only submits a job and waits for
> completion.
> So it can fail if tasks start to fail or timeout.
>  I Have seen distcp fail and hang before albeit not often.
>
> Sent from my iPhone
>
> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
> if cluster is busy with other jobs distcp will wait for free map slots.
> Regular cp is more reliable and predictable. Especialy if you need to copy
> just several GB
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi，
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
does)


I did not run any comparisons as my dev cluster is just a two node cluster
and not sure how this would perform on a production cluster.

Kay


On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes makes sense...  cp is serialized and simpler, and does not rely on
> jobtracker- Whereas distcp actually only submits a job and waits for
> completion.
> So it can fail if tasks start to fail or timeout.
>  I Have seen distcp fail and hang before albeit not often.
>
> Sent from my iPhone
>
> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
> if cluster is busy with other jobs distcp will wait for free map slots.
> Regular cp is more reliable and predictable. Especialy if you need to copy
> just several GB
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi，
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  麦树荣
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

Summing up what would be the recommendations for copy

1) DistCP
2) shell cp command
3) Using File System API(FileUtils to be precise) inside of a Java program
4) A MR with an Identity Mapper and no Reducer (may be this is what DistCP
does)


I did not run any comparisons as my dev cluster is just a two node cluster
and not sure how this would perform on a production cluster.

Kay


On Thu, Apr 11, 2013 at 5:44 AM, Jay Vyas <ja...@gmail.com> wrote:

> Yes makes sense...  cp is serialized and simpler, and does not rely on
> jobtracker- Whereas distcp actually only submits a job and waits for
> completion.
> So it can fail if tasks start to fail or timeout.
>  I Have seen distcp fail and hang before albeit not often.
>
> Sent from my iPhone
>
> On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
> if cluster is busy with other jobs distcp will wait for free map slots.
> Regular cp is more reliable and predictable. Especialy if you need to copy
> just several GB
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>
>> CP command is not parallel, It's just call FileSystem, even if DFSClient
>> has multi threads.
>>
>> DistCp can work well on the same cluster.
>>
>>
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>
>>> The File System Copy utility copies files byte by byte if I'm not wrong.
>>> Could it be possible that the cp command works with blocks and moves them
>>> which could be significantly efficient?
>>>
>>>
>>> Also how does the cp command work if the file is distributed on
>>> different data nodes??
>>>
>>> Thanks
>>> Kay
>>>
>>>
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>>> a "fully" parallel copy to the detsination).
>>>>
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>>> issues a copy command for every source file.
>>>>
>>>> I have an additional question: how is CP which is internal to a cluster
>>>> optimized (if at all) ?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 7:28 PM, ������ <sh...@qunar.com> wrote:
>>>>
>>>>> **
>>>>> Hi��
>>>>>
>>>>> I think it' better using Copy in the same cluster while using distCP
>>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>>> will not copy files locally.
>>>>>
>>>>> ------------------------------
>>>>>  ������
>>>>>
>>>>>  *From:* KayVajj <va...@gmail.com>
>>>>> *Date:* 2013-04-11 06:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Copy Vs DistCP
>>>>>       I have few questions regarding the usage of DistCP for copying
>>>>> files in the same cluster.
>>>>>
>>>>>
>>>>> 1) Which one is better within a  same cluster and what factors (like
>>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>>
>>>>>  2) when we run a cp command like below from a  client node of the
>>>>> cluster (not a data node), How does the cp command work
>>>>>       i) like an MR job
>>>>>      ii) copy files locally and then it copy it back at the new
>>>>> location.
>>>>>
>>>>>  Example of the copy command
>>>>>
>>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>>
>>>>>  Thanks, your responses are appreciated.
>>>>>
>>>>>  -- Kay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

Yes makes sense...  cp is serialized and simpler, and does not rely on jobtracker- Whereas distcp actually only submits a job and waits for completion.  
So it can fail if tasks start to fail or timeout. 
 I Have seen distcp fail and hang before albeit not often.

Sent from my iPhone

On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com> wrote:

> if cluster is busy with other jobs distcp will wait for free map slots. Regular cp is more reliable and predictable. Especialy if you need to copy just several GB
> 
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>> CP command is not parallel, It's just call FileSystem, even if DFSClient has multi threads.
>> 
>> DistCp can work well on the same cluster.
>> 
>> 
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>> The File System Copy utility copies files byte by byte if I'm not wrong. Could it be possible that the cp command works with blocks and moves them which could be significantly efficient? 
>>> 
>>> 
>>> Also how does the cp command work if the file is distributed on different data nodes??
>>> 
>>> Thanks
>>> Kay
>>> 
>>> 
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a "fully" parallel copy to the detsination).  
>>>> 
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and issues a copy command for every source file.
>>>> 
>>>> I have an additional question: how is CP which is internal to a cluster optimized (if at all) ? 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>> Hi，
>>>>>  
>>>>> I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.
>>>>>  
>>>>> 麦树荣
>>>>>  
>>>>> From: KayVajj
>>>>> Date: 2013-04-11 06:20
>>>>> To: user@hadoop.apache.org
>>>>> Subject: Copy Vs DistCP
>>>>> I have few questions regarding the usage of DistCP for copying files in the same cluster.
>>>>> 
>>>>> 
>>>>> 1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?
>>>>> 
>>>>> 2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
>>>>>      i) like an MR job
>>>>>     ii) copy files locally and then it copy it back at the new location.
>>>>> 
>>>>> Example of the copy command 
>>>>> 
>>>>> hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>> 
>>>>> Thanks, your responses are appreciated.
>>>>> 
>>>>> -- Kay
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

Yes makes sense...  cp is serialized and simpler, and does not rely on jobtracker- Whereas distcp actually only submits a job and waits for completion.  
So it can fail if tasks start to fail or timeout. 
 I Have seen distcp fail and hang before albeit not often.

Sent from my iPhone

On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com> wrote:

> if cluster is busy with other jobs distcp will wait for free map slots. Regular cp is more reliable and predictable. Especialy if you need to copy just several GB
> 
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>> CP command is not parallel, It's just call FileSystem, even if DFSClient has multi threads.
>> 
>> DistCp can work well on the same cluster.
>> 
>> 
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>> The File System Copy utility copies files byte by byte if I'm not wrong. Could it be possible that the cp command works with blocks and moves them which could be significantly efficient? 
>>> 
>>> 
>>> Also how does the cp command work if the file is distributed on different data nodes??
>>> 
>>> Thanks
>>> Kay
>>> 
>>> 
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a "fully" parallel copy to the detsination).  
>>>> 
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and issues a copy command for every source file.
>>>> 
>>>> I have an additional question: how is CP which is internal to a cluster optimized (if at all) ? 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>> Hi，
>>>>>  
>>>>> I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.
>>>>>  
>>>>> 麦树荣
>>>>>  
>>>>> From: KayVajj
>>>>> Date: 2013-04-11 06:20
>>>>> To: user@hadoop.apache.org
>>>>> Subject: Copy Vs DistCP
>>>>> I have few questions regarding the usage of DistCP for copying files in the same cluster.
>>>>> 
>>>>> 
>>>>> 1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?
>>>>> 
>>>>> 2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
>>>>>      i) like an MR job
>>>>>     ii) copy files locally and then it copy it back at the new location.
>>>>> 
>>>>> Example of the copy command 
>>>>> 
>>>>> hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>> 
>>>>> Thanks, your responses are appreciated.
>>>>> 
>>>>> -- Kay
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

Yes makes sense...  cp is serialized and simpler, and does not rely on jobtracker- Whereas distcp actually only submits a job and waits for completion.  
So it can fail if tasks start to fail or timeout. 
 I Have seen distcp fail and hang before albeit not often.

Sent from my iPhone

On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com> wrote:

> if cluster is busy with other jobs distcp will wait for free map slots. Regular cp is more reliable and predictable. Especialy if you need to copy just several GB
> 
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>> CP command is not parallel, It's just call FileSystem, even if DFSClient has multi threads.
>> 
>> DistCp can work well on the same cluster.
>> 
>> 
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>> The File System Copy utility copies files byte by byte if I'm not wrong. Could it be possible that the cp command works with blocks and moves them which could be significantly efficient? 
>>> 
>>> 
>>> Also how does the cp command work if the file is distributed on different data nodes??
>>> 
>>> Thanks
>>> Kay
>>> 
>>> 
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a "fully" parallel copy to the detsination).  
>>>> 
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and issues a copy command for every source file.
>>>> 
>>>> I have an additional question: how is CP which is internal to a cluster optimized (if at all) ? 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>> Hi，
>>>>>  
>>>>> I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.
>>>>>  
>>>>> 麦树荣
>>>>>  
>>>>> From: KayVajj
>>>>> Date: 2013-04-11 06:20
>>>>> To: user@hadoop.apache.org
>>>>> Subject: Copy Vs DistCP
>>>>> I have few questions regarding the usage of DistCP for copying files in the same cluster.
>>>>> 
>>>>> 
>>>>> 1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?
>>>>> 
>>>>> 2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
>>>>>      i) like an MR job
>>>>>     ii) copy files locally and then it copy it back at the new location.
>>>>> 
>>>>> Example of the copy command 
>>>>> 
>>>>> hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>> 
>>>>> Thanks, your responses are appreciated.
>>>>> 
>>>>> -- Kay
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

Yes makes sense...  cp is serialized and simpler, and does not rely on jobtracker- Whereas distcp actually only submits a job and waits for completion.  
So it can fail if tasks start to fail or timeout. 
 I Have seen distcp fail and hang before albeit not often.

Sent from my iPhone

On Apr 10, 2013, at 10:37 PM, Alexander Pivovarov <ap...@gmail.com> wrote:

> if cluster is busy with other jobs distcp will wait for free map slots. Regular cp is more reliable and predictable. Especialy if you need to copy just several GB
> 
> On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:
>> CP command is not parallel, It's just call FileSystem, even if DFSClient has multi threads.
>> 
>> DistCp can work well on the same cluster.
>> 
>> 
>> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>>> The File System Copy utility copies files byte by byte if I'm not wrong. Could it be possible that the cp command works with blocks and moves them which could be significantly efficient? 
>>> 
>>> 
>>> Also how does the cp command work if the file is distributed on different data nodes??
>>> 
>>> Thanks
>>> Kay
>>> 
>>> 
>>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a "fully" parallel copy to the detsination).  
>>>> 
>>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and issues a copy command for every source file.
>>>> 
>>>> I have an additional question: how is CP which is internal to a cluster optimized (if at all) ? 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>>> Hi，
>>>>>  
>>>>> I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.
>>>>>  
>>>>> 麦树荣
>>>>>  
>>>>> From: KayVajj
>>>>> Date: 2013-04-11 06:20
>>>>> To: user@hadoop.apache.org
>>>>> Subject: Copy Vs DistCP
>>>>> I have few questions regarding the usage of DistCP for copying files in the same cluster.
>>>>> 
>>>>> 
>>>>> 1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?
>>>>> 
>>>>> 2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
>>>>>      i) like an MR job
>>>>>     ii) copy files locally and then it copy it back at the new location.
>>>>> 
>>>>> Example of the copy command 
>>>>> 
>>>>> hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>> 
>>>>> Thanks, your responses are appreciated.
>>>>> 
>>>>> -- Kay
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Alexander Pivovarov <ap...@gmail.com>.

if cluster is busy with other jobs distcp will wait for free map slots.
Regular cp is more reliable and predictable. Especialy if you need to copy
just several GB
On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Alexander Pivovarov <ap...@gmail.com>.

if cluster is busy with other jobs distcp will wait for free map slots.
Regular cp is more reliable and predictable. Especialy if you need to copy
just several GB
On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Alexander Pivovarov <ap...@gmail.com>.

if cluster is busy with other jobs distcp will wait for free map slots.
Regular cp is more reliable and predictable. Especialy if you need to copy
just several GB
On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Alexander Pivovarov <ap...@gmail.com>.

if cluster is busy with other jobs distcp will wait for free map slots.
Regular cp is more reliable and predictable. Especialy if you need to copy
just several GB
On Apr 10, 2013 6:31 PM, "Azuryy Yu" <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

If CP command is not parallel how does it work for a file partitioned on
various data nodes?


On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

If CP command is not parallel how does it work for a file partitioned on
various data nodes?


On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

If CP command is not parallel how does it work for a file partitioned on
various data nodes?


On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu <az...@gmail.com> wrote:

> CP command is not parallel, It's just call FileSystem, even if DFSClient
> has multi threads.
>
> DistCp can work well on the same cluster.
>
>
> On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:
>
>> The File System Copy utility copies files byte by byte if I'm not wrong.
>> Could it be possible that the cp command works with blocks and moves them
>> which could be significantly efficient?
>>
>>
>> Also how does the cp command work if the file is distributed on different
>> data nodes??
>>
>> Thanks
>> Kay
>>
>>
>> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> DistCP is a full blown mapreduce job (mapper only, where the mappers do
>>> a "fully" parallel copy to the detsination).
>>>
>>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>>> issues a copy command for every source file.
>>>
>>> I have an additional question: how is CP which is internal to a cluster
>>> optimized (if at all) ?
>>>
>>>
>>>
>>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>>
>>>> **
>>>> Hi，
>>>>
>>>> I think it' better using Copy in the same cluster while using distCP
>>>> between clusters, and cp command is a hadoop internal parallel process and
>>>> will not copy files locally.
>>>>
>>>> ------------------------------
>>>>  麦树荣
>>>>
>>>>  *From:* KayVajj <va...@gmail.com>
>>>> *Date:* 2013-04-11 06:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Copy Vs DistCP
>>>>       I have few questions regarding the usage of DistCP for copying
>>>> files in the same cluster.
>>>>
>>>>
>>>> 1) Which one is better within a  same cluster and what factors (like
>>>> file size etc) wouldinfluence the usage of one over te other?
>>>>
>>>>  2) when we run a cp command like below from a  client node of the
>>>> cluster (not a data node), How does the cp command work
>>>>       i) like an MR job
>>>>      ii) copy files locally and then it copy it back at the new
>>>> location.
>>>>
>>>>  Example of the copy command
>>>>
>>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>>
>>>>  Thanks, your responses are appreciated.
>>>>
>>>>  -- Kay
>>>>
>>>
>>>
>>>
>>> --
>>> Jay Vyas
>>> http://jayunit100.blogspot.com
>>>
>>
>>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

CP command is not parallel, It's just call FileSystem, even if DFSClient
has multi threads.

DistCp can work well on the same cluster.


On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:

> The File System Copy utility copies files byte by byte if I'm not wrong.
> Could it be possible that the cp command works with blocks and moves them
> which could be significantly efficient?
>
>
> Also how does the cp command work if the file is distributed on different
> data nodes??
>
> Thanks
> Kay
>
>
> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
>> "fully" parallel copy to the detsination).
>>
>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>> issues a copy command for every source file.
>>
>> I have an additional question: how is CP which is internal to a cluster
>> optimized (if at all) ?
>>
>>
>>
>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>
>>> **
>>> Hi，
>>>
>>> I think it' better using Copy in the same cluster while using distCP
>>> between clusters, and cp command is a hadoop internal parallel process and
>>> will not copy files locally.
>>>
>>> ------------------------------
>>>  麦树荣
>>>
>>>  *From:* KayVajj <va...@gmail.com>
>>> *Date:* 2013-04-11 06:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Copy Vs DistCP
>>>       I have few questions regarding the usage of DistCP for copying
>>> files in the same cluster.
>>>
>>>
>>> 1) Which one is better within a  same cluster and what factors (like
>>> file size etc) wouldinfluence the usage of one over te other?
>>>
>>>  2) when we run a cp command like below from a  client node of the
>>> cluster (not a data node), How does the cp command work
>>>       i) like an MR job
>>>      ii) copy files locally and then it copy it back at the new location.
>>>
>>>  Example of the copy command
>>>
>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>
>>>  Thanks, your responses are appreciated.
>>>
>>>  -- Kay
>>>
>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

CP command is not parallel, It's just call FileSystem, even if DFSClient
has multi threads.

DistCp can work well on the same cluster.


On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:

> The File System Copy utility copies files byte by byte if I'm not wrong.
> Could it be possible that the cp command works with blocks and moves them
> which could be significantly efficient?
>
>
> Also how does the cp command work if the file is distributed on different
> data nodes??
>
> Thanks
> Kay
>
>
> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
>> "fully" parallel copy to the detsination).
>>
>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>> issues a copy command for every source file.
>>
>> I have an additional question: how is CP which is internal to a cluster
>> optimized (if at all) ?
>>
>>
>>
>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>
>>> **
>>> Hi，
>>>
>>> I think it' better using Copy in the same cluster while using distCP
>>> between clusters, and cp command is a hadoop internal parallel process and
>>> will not copy files locally.
>>>
>>> ------------------------------
>>>  麦树荣
>>>
>>>  *From:* KayVajj <va...@gmail.com>
>>> *Date:* 2013-04-11 06:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Copy Vs DistCP
>>>       I have few questions regarding the usage of DistCP for copying
>>> files in the same cluster.
>>>
>>>
>>> 1) Which one is better within a  same cluster and what factors (like
>>> file size etc) wouldinfluence the usage of one over te other?
>>>
>>>  2) when we run a cp command like below from a  client node of the
>>> cluster (not a data node), How does the cp command work
>>>       i) like an MR job
>>>      ii) copy files locally and then it copy it back at the new location.
>>>
>>>  Example of the copy command
>>>
>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>
>>>  Thanks, your responses are appreciated.
>>>
>>>  -- Kay
>>>
>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

CP command is not parallel, It's just call FileSystem, even if DFSClient
has multi threads.

DistCp can work well on the same cluster.


On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:

> The File System Copy utility copies files byte by byte if I'm not wrong.
> Could it be possible that the cp command works with blocks and moves them
> which could be significantly efficient?
>
>
> Also how does the cp command work if the file is distributed on different
> data nodes??
>
> Thanks
> Kay
>
>
> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
>> "fully" parallel copy to the detsination).
>>
>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>> issues a copy command for every source file.
>>
>> I have an additional question: how is CP which is internal to a cluster
>> optimized (if at all) ?
>>
>>
>>
>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>
>>> **
>>> Hi，
>>>
>>> I think it' better using Copy in the same cluster while using distCP
>>> between clusters, and cp command is a hadoop internal parallel process and
>>> will not copy files locally.
>>>
>>> ------------------------------
>>>  麦树荣
>>>
>>>  *From:* KayVajj <va...@gmail.com>
>>> *Date:* 2013-04-11 06:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Copy Vs DistCP
>>>       I have few questions regarding the usage of DistCP for copying
>>> files in the same cluster.
>>>
>>>
>>> 1) Which one is better within a  same cluster and what factors (like
>>> file size etc) wouldinfluence the usage of one over te other?
>>>
>>>  2) when we run a cp command like below from a  client node of the
>>> cluster (not a data node), How does the cp command work
>>>       i) like an MR job
>>>      ii) copy files locally and then it copy it back at the new location.
>>>
>>>  Example of the copy command
>>>
>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>
>>>  Thanks, your responses are appreciated.
>>>
>>>  -- Kay
>>>
>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>

Re: Copy Vs DistCP

Posted by Azuryy Yu <az...@gmail.com>.

CP command is not parallel, It's just call FileSystem, even if DFSClient
has multi threads.

DistCp can work well on the same cluster.


On Thu, Apr 11, 2013 at 8:17 AM, KayVajj <va...@gmail.com> wrote:

> The File System Copy utility copies files byte by byte if I'm not wrong.
> Could it be possible that the cp command works with blocks and moves them
> which could be significantly efficient?
>
>
> Also how does the cp command work if the file is distributed on different
> data nodes??
>
> Thanks
> Kay
>
>
> On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
>> "fully" parallel copy to the detsination).
>>
>> CP appears (correct me if im wrong) to simply invoke the FileSystem and
>> issues a copy command for every source file.
>>
>> I have an additional question: how is CP which is internal to a cluster
>> optimized (if at all) ?
>>
>>
>>
>> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>>
>>> **
>>> Hi，
>>>
>>> I think it' better using Copy in the same cluster while using distCP
>>> between clusters, and cp command is a hadoop internal parallel process and
>>> will not copy files locally.
>>>
>>> ------------------------------
>>>  麦树荣
>>>
>>>  *From:* KayVajj <va...@gmail.com>
>>> *Date:* 2013-04-11 06:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Copy Vs DistCP
>>>       I have few questions regarding the usage of DistCP for copying
>>> files in the same cluster.
>>>
>>>
>>> 1) Which one is better within a  same cluster and what factors (like
>>> file size etc) wouldinfluence the usage of one over te other?
>>>
>>>  2) when we run a cp command like below from a  client node of the
>>> cluster (not a data node), How does the cp command work
>>>       i) like an MR job
>>>      ii) copy files locally and then it copy it back at the new location.
>>>
>>>  Example of the copy command
>>>
>>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>>
>>>  Thanks, your responses are appreciated.
>>>
>>>  -- Kay
>>>
>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

The File System Copy utility copies files byte by byte if I'm not wrong.
Could it be possible that the cp command works with blocks and moves them
which could be significantly efficient?


Also how does the cp command work if the file is distributed on different
data nodes??

Thanks
Kay


On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:

> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
> "fully" parallel copy to the detsination).
>
> CP appears (correct me if im wrong) to simply invoke the FileSystem and
> issues a copy command for every source file.
>
> I have an additional question: how is CP which is internal to a cluster
> optimized (if at all) ?
>
>
>
> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>
>> **
>> Hi，
>>
>> I think it' better using Copy in the same cluster while using distCP
>> between clusters, and cp command is a hadoop internal parallel process and
>> will not copy files locally.
>>
>> ------------------------------
>>  麦树荣
>>
>>  *From:* KayVajj <va...@gmail.com>
>> *Date:* 2013-04-11 06:20
>> *To:* user@hadoop.apache.org
>> *Subject:* Copy Vs DistCP
>>       I have few questions regarding the usage of DistCP for copying
>> files in the same cluster.
>>
>>
>> 1) Which one is better within a  same cluster and what factors (like file
>> size etc) wouldinfluence the usage of one over te other?
>>
>>  2) when we run a cp command like below from a  client node of the
>> cluster (not a data node), How does the cp command work
>>       i) like an MR job
>>      ii) copy files locally and then it copy it back at the new location.
>>
>>  Example of the copy command
>>
>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>  Thanks, your responses are appreciated.
>>
>>  -- Kay
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

The File System Copy utility copies files byte by byte if I'm not wrong.
Could it be possible that the cp command works with blocks and moves them
which could be significantly efficient?


Also how does the cp command work if the file is distributed on different
data nodes??

Thanks
Kay


On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:

> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
> "fully" parallel copy to the detsination).
>
> CP appears (correct me if im wrong) to simply invoke the FileSystem and
> issues a copy command for every source file.
>
> I have an additional question: how is CP which is internal to a cluster
> optimized (if at all) ?
>
>
>
> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>
>> **
>> Hi，
>>
>> I think it' better using Copy in the same cluster while using distCP
>> between clusters, and cp command is a hadoop internal parallel process and
>> will not copy files locally.
>>
>> ------------------------------
>>  麦树荣
>>
>>  *From:* KayVajj <va...@gmail.com>
>> *Date:* 2013-04-11 06:20
>> *To:* user@hadoop.apache.org
>> *Subject:* Copy Vs DistCP
>>       I have few questions regarding the usage of DistCP for copying
>> files in the same cluster.
>>
>>
>> 1) Which one is better within a  same cluster and what factors (like file
>> size etc) wouldinfluence the usage of one over te other?
>>
>>  2) when we run a cp command like below from a  client node of the
>> cluster (not a data node), How does the cp command work
>>       i) like an MR job
>>      ii) copy files locally and then it copy it back at the new location.
>>
>>  Example of the copy command
>>
>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>  Thanks, your responses are appreciated.
>>
>>  -- Kay
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

The File System Copy utility copies files byte by byte if I'm not wrong.
Could it be possible that the cp command works with blocks and moves them
which could be significantly efficient?


Also how does the cp command work if the file is distributed on different
data nodes??

Thanks
Kay


On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:

> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
> "fully" parallel copy to the detsination).
>
> CP appears (correct me if im wrong) to simply invoke the FileSystem and
> issues a copy command for every source file.
>
> I have an additional question: how is CP which is internal to a cluster
> optimized (if at all) ?
>
>
>
> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>
>> **
>> Hi，
>>
>> I think it' better using Copy in the same cluster while using distCP
>> between clusters, and cp command is a hadoop internal parallel process and
>> will not copy files locally.
>>
>> ------------------------------
>>  麦树荣
>>
>>  *From:* KayVajj <va...@gmail.com>
>> *Date:* 2013-04-11 06:20
>> *To:* user@hadoop.apache.org
>> *Subject:* Copy Vs DistCP
>>       I have few questions regarding the usage of DistCP for copying
>> files in the same cluster.
>>
>>
>> 1) Which one is better within a  same cluster and what factors (like file
>> size etc) wouldinfluence the usage of one over te other?
>>
>>  2) when we run a cp command like below from a  client node of the
>> cluster (not a data node), How does the cp command work
>>       i) like an MR job
>>      ii) copy files locally and then it copy it back at the new location.
>>
>>  Example of the copy command
>>
>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>  Thanks, your responses are appreciated.
>>
>>  -- Kay
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Copy Vs DistCP

Posted by KayVajj <va...@gmail.com>.

The File System Copy utility copies files byte by byte if I'm not wrong.
Could it be possible that the cp command works with blocks and moves them
which could be significantly efficient?


Also how does the cp command work if the file is distributed on different
data nodes??

Thanks
Kay


On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas <ja...@gmail.com> wrote:

> DistCP is a full blown mapreduce job (mapper only, where the mappers do a
> "fully" parallel copy to the detsination).
>
> CP appears (correct me if im wrong) to simply invoke the FileSystem and
> issues a copy command for every source file.
>
> I have an additional question: how is CP which is internal to a cluster
> optimized (if at all) ?
>
>
>
> On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:
>
>> **
>> Hi，
>>
>> I think it' better using Copy in the same cluster while using distCP
>> between clusters, and cp command is a hadoop internal parallel process and
>> will not copy files locally.
>>
>> ------------------------------
>>  麦树荣
>>
>>  *From:* KayVajj <va...@gmail.com>
>> *Date:* 2013-04-11 06:20
>> *To:* user@hadoop.apache.org
>> *Subject:* Copy Vs DistCP
>>       I have few questions regarding the usage of DistCP for copying
>> files in the same cluster.
>>
>>
>> 1) Which one is better within a  same cluster and what factors (like file
>> size etc) wouldinfluence the usage of one over te other?
>>
>>  2) when we run a cp command like below from a  client node of the
>> cluster (not a data node), How does the cp command work
>>       i) like an MR job
>>      ii) copy files locally and then it copy it back at the new location.
>>
>>  Example of the copy command
>>
>>  hdfs dfs -cp /<some_location>/file /<new_location>/
>>
>>  Thanks, your responses are appreciated.
>>
>>  -- Kay
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?



On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:

> **
> Hi，
>
> I think it' better using Copy in the same cluster while using distCP
> between clusters, and cp command is a hadoop internal parallel process and
> will not copy files locally.
>
> ------------------------------
>  麦树荣
>
>  *From:* KayVajj <va...@gmail.com>
> *Date:* 2013-04-11 06:20
> *To:* user@hadoop.apache.org
> *Subject:* Copy Vs DistCP
>       I have few questions regarding the usage of DistCP for copying
> files in the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
>  2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>       i) like an MR job
>      ii) copy files locally and then it copy it back at the new location.
>
>  Example of the copy command
>
>  hdfs dfs -cp /<some_location>/file /<new_location>/
>
>  Thanks, your responses are appreciated.
>
>  -- Kay
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?



On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:

> **
> Hi，
>
> I think it' better using Copy in the same cluster while using distCP
> between clusters, and cp command is a hadoop internal parallel process and
> will not copy files locally.
>
> ------------------------------
>  麦树荣
>
>  *From:* KayVajj <va...@gmail.com>
> *Date:* 2013-04-11 06:20
> *To:* user@hadoop.apache.org
> *Subject:* Copy Vs DistCP
>       I have few questions regarding the usage of DistCP for copying
> files in the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
>  2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>       i) like an MR job
>      ii) copy files locally and then it copy it back at the new location.
>
>  Example of the copy command
>
>  hdfs dfs -cp /<some_location>/file /<new_location>/
>
>  Thanks, your responses are appreciated.
>
>  -- Kay
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?



On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:

> **
> Hi，
>
> I think it' better using Copy in the same cluster while using distCP
> between clusters, and cp command is a hadoop internal parallel process and
> will not copy files locally.
>
> ------------------------------
>  麦树荣
>
>  *From:* KayVajj <va...@gmail.com>
> *Date:* 2013-04-11 06:20
> *To:* user@hadoop.apache.org
> *Subject:* Copy Vs DistCP
>       I have few questions regarding the usage of DistCP for copying
> files in the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
>  2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>       i) like an MR job
>      ii) copy files locally and then it copy it back at the new location.
>
>  Example of the copy command
>
>  hdfs dfs -cp /<some_location>/file /<new_location>/
>
>  Thanks, your responses are appreciated.
>
>  -- Kay
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?



On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 <sh...@qunar.com> wrote:

> **
> Hi，
>
> I think it' better using Copy in the same cluster while using distCP
> between clusters, and cp command is a hadoop internal parallel process and
> will not copy files locally.
>
> ------------------------------
>  麦树荣
>
>  *From:* KayVajj <va...@gmail.com>
> *Date:* 2013-04-11 06:20
> *To:* user@hadoop.apache.org
> *Subject:* Copy Vs DistCP
>       I have few questions regarding the usage of DistCP for copying
> files in the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
>  2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>       i) like an MR job
>      ii) copy files locally and then it copy it back at the new location.
>
>  Example of the copy command
>
>  hdfs dfs -cp /<some_location>/file /<new_location>/
>
>  Thanks, your responses are appreciated.
>
>  -- Kay
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by 麦树荣 <sh...@qunar.com>.

Hi，

I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.

________________________________
麦树荣

From: KayVajj<ma...@gmail.com>
Date: 2013-04-11 06:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Copy Vs DistCP
I have few questions regarding the usage of DistCP for copying files in the same cluster.

1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?

2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
     i) like an MR job
    ii) copy files locally and then it copy it back at the new location.

Example of the copy command

hdfs dfs -cp /<some_location>/file /<new_location>/

Thanks, your responses are appreciated.

-- Kay

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?

On Wed, Apr 10, 2013 at 6:20 PM, KayVajj <va...@gmail.com> wrote:

> I have few questions regarding the usage of DistCP for copying files in
> the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
> 2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>      i) like an MR job
>     ii) copy files locally and then it copy it back at the new location.
>
> Example of the copy command
>
> hdfs dfs -cp /<some_location>/file /<new_location>/
>
> Thanks, your responses are appreciated.
>
> -- Kay
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by 麦树荣 <sh...@qunar.com>.

Hi，

I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.

________________________________
麦树荣

From: KayVajj<ma...@gmail.com>
Date: 2013-04-11 06:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Copy Vs DistCP
I have few questions regarding the usage of DistCP for copying files in the same cluster.

1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?

2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
     i) like an MR job
    ii) copy files locally and then it copy it back at the new location.

Example of the copy command

hdfs dfs -cp /<some_location>/file /<new_location>/

Thanks, your responses are appreciated.

-- Kay

Re: Copy Vs DistCP

Posted by Jay Vyas <ja...@gmail.com>.

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
"fully" parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?

On Wed, Apr 10, 2013 at 6:20 PM, KayVajj <va...@gmail.com> wrote:

> I have few questions regarding the usage of DistCP for copying files in
> the same cluster.
>
>
> 1) Which one is better within a  same cluster and what factors (like file
> size etc) wouldinfluence the usage of one over te other?
>
> 2) when we run a cp command like below from a  client node of the cluster
> (not a data node), How does the cp command work
>      i) like an MR job
>     ii) copy files locally and then it copy it back at the new location.
>
> Example of the copy command
>
> hdfs dfs -cp /<some_location>/file /<new_location>/
>
> Thanks, your responses are appreciated.
>
> -- Kay
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

Posted by 麦树荣 <sh...@qunar.com>.

Hi，

I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.

________________________________
麦树荣

From: KayVajj<ma...@gmail.com>
Date: 2013-04-11 06:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Copy Vs DistCP
I have few questions regarding the usage of DistCP for copying files in the same cluster.

1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?

2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
     i) like an MR job
    ii) copy files locally and then it copy it back at the new location.

Example of the copy command

hdfs dfs -cp /<some_location>/file /<new_location>/

Thanks, your responses are appreciated.

-- Kay

Re: Copy Vs DistCP

Posted by 麦树荣 <sh...@qunar.com>.

Hi，

I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally.

________________________________
麦树荣

From: KayVajj<ma...@gmail.com>
Date: 2013-04-11 06:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Copy Vs DistCP
I have few questions regarding the usage of DistCP for copying files in the same cluster.

1) Which one is better within a  same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other?

2) when we run a cp command like below from a  client node of the cluster (not a data node), How does the cp command work
     i) like an MR job
    ii) copy files locally and then it copy it back at the new location.

Example of the copy command

hdfs dfs -cp /<some_location>/file /<new_location>/

Thanks, your responses are appreciated.

-- Kay