You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by John Meza <j_...@hotmail.com> on 2013/03/09 19:07:56 UTC

copytolocal vs distcp

I need suggestions on best methods of copying  alot of data (~6Tb) from a cluster (20-dn) to the local file system. 
While distcp has much more throughput compared to copytolocal (I think) because it uses MR jobs,  it doesn't seem to work well with the following syntax   <desturl> =   "file://fs4/outdir/" 
Problem: It puts in the home dir for the linux user. To get this to work I need to redefine the users home dir to the output dir (lun) with lotsa disk space.?
copytolocal is straightforward to use, but lacks the throughput (I think).
Suggestions? Advice?thanksJohn 		 	   		  

RE: copytolocal vs distcp

Posted by John Meza <j_...@hotmail.com>.
The file:///fs4/outdir solved the outfile location issue. Dhaval Shah made the same suggestion. That's good.But getting Map exceptions now. Given your comment about conventional NAS this all may be for naught. Let me describe my -planned- workflow:-export data from hdfs to local-dir (which is a directory on a lun off my Netapp filer)-copy to portable disk array, send to cloud provider-import to hdfs
Q:all Maps output to local dirs on each datanode?Q:20 dns writing to same lun will have multiple issues:  -possible directory naming collisions?  -bottleneck at controller on filer? I think yes.Q:i should just start using copytolocal now, hopefully it will complete by Monday am.
thanksJohn
From: tdunning@maprtech.com
Date: Sat, 9 Mar 2013 14:00:52 -0500
Subject: Re: copytolocal vs distcp
To: user@hadoop.apache.org


Try file:///fs4/outdir
Symbolic links can also help.
Note that this file system has to be visible with the same path on all hosts.  You may also be bandwidth limited by whatever is serving that file system.


There are cases where you won't be limited by the file system.  MapR, for instance, has a completely distributed NFS server and specialized file systems like lustre might also have distributed network traffic. If you are just writing to a conventional NAS, however, this is unlikely to win much relative to copytolocal simply due to bottlenecking.





On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:





I need suggestions on best methods of copying  alot of data (~6Tb) from a cluster (20-dn) to the local file system. 


While distcp has much more throughput compared to copytolocal (I think) because it uses MR jobs,  it doesn't seem to work well with the following syntax   <desturl> =   "file://fs4/outdir/" 


Problem: It puts in the home dir for the linux user. To get this to work I need to redefine the users home dir to the output dir (lun) with lotsa disk space.?


copytolocal is straightforward to use, but lacks the throughput (I think).


Suggestions? Advice?thanksJohn 		 	   		  



 		 	   		  

RE: copytolocal vs distcp

Posted by John Meza <j_...@hotmail.com>.
The file:///fs4/outdir solved the outfile location issue. Dhaval Shah made the same suggestion. That's good.But getting Map exceptions now. Given your comment about conventional NAS this all may be for naught. Let me describe my -planned- workflow:-export data from hdfs to local-dir (which is a directory on a lun off my Netapp filer)-copy to portable disk array, send to cloud provider-import to hdfs
Q:all Maps output to local dirs on each datanode?Q:20 dns writing to same lun will have multiple issues:  -possible directory naming collisions?  -bottleneck at controller on filer? I think yes.Q:i should just start using copytolocal now, hopefully it will complete by Monday am.
thanksJohn
From: tdunning@maprtech.com
Date: Sat, 9 Mar 2013 14:00:52 -0500
Subject: Re: copytolocal vs distcp
To: user@hadoop.apache.org


Try file:///fs4/outdir
Symbolic links can also help.
Note that this file system has to be visible with the same path on all hosts.  You may also be bandwidth limited by whatever is serving that file system.


There are cases where you won't be limited by the file system.  MapR, for instance, has a completely distributed NFS server and specialized file systems like lustre might also have distributed network traffic. If you are just writing to a conventional NAS, however, this is unlikely to win much relative to copytolocal simply due to bottlenecking.





On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:





I need suggestions on best methods of copying  alot of data (~6Tb) from a cluster (20-dn) to the local file system. 


While distcp has much more throughput compared to copytolocal (I think) because it uses MR jobs,  it doesn't seem to work well with the following syntax   <desturl> =   "file://fs4/outdir/" 


Problem: It puts in the home dir for the linux user. To get this to work I need to redefine the users home dir to the output dir (lun) with lotsa disk space.?


copytolocal is straightforward to use, but lacks the throughput (I think).


Suggestions? Advice?thanksJohn 		 	   		  



 		 	   		  

RE: copytolocal vs distcp

Posted by John Meza <j_...@hotmail.com>.
The file:///fs4/outdir solved the outfile location issue. Dhaval Shah made the same suggestion. That's good.But getting Map exceptions now. Given your comment about conventional NAS this all may be for naught. Let me describe my -planned- workflow:-export data from hdfs to local-dir (which is a directory on a lun off my Netapp filer)-copy to portable disk array, send to cloud provider-import to hdfs
Q:all Maps output to local dirs on each datanode?Q:20 dns writing to same lun will have multiple issues:  -possible directory naming collisions?  -bottleneck at controller on filer? I think yes.Q:i should just start using copytolocal now, hopefully it will complete by Monday am.
thanksJohn
From: tdunning@maprtech.com
Date: Sat, 9 Mar 2013 14:00:52 -0500
Subject: Re: copytolocal vs distcp
To: user@hadoop.apache.org


Try file:///fs4/outdir
Symbolic links can also help.
Note that this file system has to be visible with the same path on all hosts.  You may also be bandwidth limited by whatever is serving that file system.


There are cases where you won't be limited by the file system.  MapR, for instance, has a completely distributed NFS server and specialized file systems like lustre might also have distributed network traffic. If you are just writing to a conventional NAS, however, this is unlikely to win much relative to copytolocal simply due to bottlenecking.





On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:





I need suggestions on best methods of copying  alot of data (~6Tb) from a cluster (20-dn) to the local file system. 


While distcp has much more throughput compared to copytolocal (I think) because it uses MR jobs,  it doesn't seem to work well with the following syntax   <desturl> =   "file://fs4/outdir/" 


Problem: It puts in the home dir for the linux user. To get this to work I need to redefine the users home dir to the output dir (lun) with lotsa disk space.?


copytolocal is straightforward to use, but lacks the throughput (I think).


Suggestions? Advice?thanksJohn 		 	   		  



 		 	   		  

RE: copytolocal vs distcp

Posted by John Meza <j_...@hotmail.com>.
The file:///fs4/outdir solved the outfile location issue. Dhaval Shah made the same suggestion. That's good.But getting Map exceptions now. Given your comment about conventional NAS this all may be for naught. Let me describe my -planned- workflow:-export data from hdfs to local-dir (which is a directory on a lun off my Netapp filer)-copy to portable disk array, send to cloud provider-import to hdfs
Q:all Maps output to local dirs on each datanode?Q:20 dns writing to same lun will have multiple issues:  -possible directory naming collisions?  -bottleneck at controller on filer? I think yes.Q:i should just start using copytolocal now, hopefully it will complete by Monday am.
thanksJohn
From: tdunning@maprtech.com
Date: Sat, 9 Mar 2013 14:00:52 -0500
Subject: Re: copytolocal vs distcp
To: user@hadoop.apache.org


Try file:///fs4/outdir
Symbolic links can also help.
Note that this file system has to be visible with the same path on all hosts.  You may also be bandwidth limited by whatever is serving that file system.


There are cases where you won't be limited by the file system.  MapR, for instance, has a completely distributed NFS server and specialized file systems like lustre might also have distributed network traffic. If you are just writing to a conventional NAS, however, this is unlikely to win much relative to copytolocal simply due to bottlenecking.





On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:





I need suggestions on best methods of copying  alot of data (~6Tb) from a cluster (20-dn) to the local file system. 


While distcp has much more throughput compared to copytolocal (I think) because it uses MR jobs,  it doesn't seem to work well with the following syntax   <desturl> =   "file://fs4/outdir/" 


Problem: It puts in the home dir for the linux user. To get this to work I need to redefine the users home dir to the output dir (lun) with lotsa disk space.?


copytolocal is straightforward to use, but lacks the throughput (I think).


Suggestions? Advice?thanksJohn 		 	   		  



 		 	   		  

Re: copytolocal vs distcp

Posted by Ted Dunning <td...@maprtech.com>.
Try file:///fs4/outdir

Symbolic links can also help.

Note that this file system has to be visible with the same path on all
hosts.  You may also be bandwidth limited by whatever is serving that file
system.

There are cases where you won't be limited by the file system.  MapR, for
instance, has a completely distributed NFS server and specialized file
systems like lustre might also have distributed network traffic. If you are
just writing to a conventional NAS, however, this is unlikely to win much
relative to copytolocal simply due to bottlenecking.




On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:

> I need suggestions on best methods of copying  alot of data (~6Tb) from a
> cluster (20-dn) to the local file system.
>
> While *distcp *has much more throughput compared to copytolocal (I think)
> because it uses MR jobs,  it doesn't seem to work well with the following
> syntax
>    <desturl> =   "file://fs4/outdir/"
>
> Problem: It puts in the home dir for the linux user. To get this to work I
> need to redefine the users home dir to the output dir (lun) with lotsa disk
> space.?
>
> *copytolocal *is straightforward to use, but lacks the throughput (I
> think).
>
> Suggestions? Advice?
> thanks
> John
>

Re: copytolocal vs distcp

Posted by Ted Dunning <td...@maprtech.com>.
Try file:///fs4/outdir

Symbolic links can also help.

Note that this file system has to be visible with the same path on all
hosts.  You may also be bandwidth limited by whatever is serving that file
system.

There are cases where you won't be limited by the file system.  MapR, for
instance, has a completely distributed NFS server and specialized file
systems like lustre might also have distributed network traffic. If you are
just writing to a conventional NAS, however, this is unlikely to win much
relative to copytolocal simply due to bottlenecking.




On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:

> I need suggestions on best methods of copying  alot of data (~6Tb) from a
> cluster (20-dn) to the local file system.
>
> While *distcp *has much more throughput compared to copytolocal (I think)
> because it uses MR jobs,  it doesn't seem to work well with the following
> syntax
>    <desturl> =   "file://fs4/outdir/"
>
> Problem: It puts in the home dir for the linux user. To get this to work I
> need to redefine the users home dir to the output dir (lun) with lotsa disk
> space.?
>
> *copytolocal *is straightforward to use, but lacks the throughput (I
> think).
>
> Suggestions? Advice?
> thanks
> John
>

Re: copytolocal vs distcp

Posted by Ted Dunning <td...@maprtech.com>.
Try file:///fs4/outdir

Symbolic links can also help.

Note that this file system has to be visible with the same path on all
hosts.  You may also be bandwidth limited by whatever is serving that file
system.

There are cases where you won't be limited by the file system.  MapR, for
instance, has a completely distributed NFS server and specialized file
systems like lustre might also have distributed network traffic. If you are
just writing to a conventional NAS, however, this is unlikely to win much
relative to copytolocal simply due to bottlenecking.




On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:

> I need suggestions on best methods of copying  alot of data (~6Tb) from a
> cluster (20-dn) to the local file system.
>
> While *distcp *has much more throughput compared to copytolocal (I think)
> because it uses MR jobs,  it doesn't seem to work well with the following
> syntax
>    <desturl> =   "file://fs4/outdir/"
>
> Problem: It puts in the home dir for the linux user. To get this to work I
> need to redefine the users home dir to the output dir (lun) with lotsa disk
> space.?
>
> *copytolocal *is straightforward to use, but lacks the throughput (I
> think).
>
> Suggestions? Advice?
> thanks
> John
>

Re: copytolocal vs distcp

Posted by Ted Dunning <td...@maprtech.com>.
Try file:///fs4/outdir

Symbolic links can also help.

Note that this file system has to be visible with the same path on all
hosts.  You may also be bandwidth limited by whatever is serving that file
system.

There are cases where you won't be limited by the file system.  MapR, for
instance, has a completely distributed NFS server and specialized file
systems like lustre might also have distributed network traffic. If you are
just writing to a conventional NAS, however, this is unlikely to win much
relative to copytolocal simply due to bottlenecking.




On Sat, Mar 9, 2013 at 1:07 PM, John Meza <j_...@hotmail.com> wrote:

> I need suggestions on best methods of copying  alot of data (~6Tb) from a
> cluster (20-dn) to the local file system.
>
> While *distcp *has much more throughput compared to copytolocal (I think)
> because it uses MR jobs,  it doesn't seem to work well with the following
> syntax
>    <desturl> =   "file://fs4/outdir/"
>
> Problem: It puts in the home dir for the linux user. To get this to work I
> need to redefine the users home dir to the output dir (lun) with lotsa disk
> space.?
>
> *copytolocal *is straightforward to use, but lacks the throughput (I
> think).
>
> Suggestions? Advice?
> thanks
> John
>