You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Kevin <ke...@gmail.com> on 2015/02/13 14:28:47 UTC

Copying many files to HDFS

Hi,

I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
or so files into HDFS, which totals roughly 1 TB. The cluster will be
isolated on its own private LAN with a single client machine that is
connected to the Hadoop cluster as well as the public network. The data
that needs to be copied into HDFS is mounted as an NFS on the client
machine.

I can run `hadoop fs -put` concurrently on the client machine to try and
increase the throughput.

If these files were able to be accessed by each node in the Hadoop cluster,
then I could write a MapReduce job to copy a number of files from the
network into HDFS. I could not find anything in the documentation saying
that `distcp` works with locally hosted files (its code in the tools
package doesn't tell any sign of it either) - but I wouldn't expect it to.

In general, are there any other ways of copying a very large number of
client-local files to HDFS? I search the mail archives to find a similar
question and I didn't come across one. I'm sorry if this is a duplicate
question.

Thanks for your time,
Kevin

Re: Copying many files to HDFS

Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,

What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?

Alex

On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:

> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>

Re: Copying many files to HDFS

Posted by Mark Payne <ma...@hotmail.com>.
Kevin <ke...@...> writes:

> 
> 
> Johny, NiFi looks interesting but I can't really grasp how it will 
help me. If you could provided some example code or a more detail 
explanation of how you set up a topology, then that would be great.



Kevin,

With NiFi you wouldn't have example code. NiFi is a dataflow automation 
tool where you construct your dataflow visually with drag-and-drop 
components. You can download it by going to nifi.incubator.apache.org 
and then going to the Downloads link. Once downloaded, you would untar 
it and run "bin/nifi.sh start"

At that point you could build your dataflow by navigating your browser 
to http://localhost:8080/nifi

There's actually a really good blog post on how to do essentially what 
you're looking to do at http://ingest.tips/2014/12/22/getting-started-
with-apache-nifi/

The idea is that the dataflow pulls in any data from a local or network 
drive into NiFi, deletes the file, and then pushes the data to HDFS. I 
would caution though that in the blog post, failure to send to HDFS is 
"auto-terminated," which means that the data would be deleted. In 
reality, you should route the "failure" relationship back to PutHDFS. I 
think this would make a lot more sense after you read the blog :)

There are also a lot of video tutorials on how to use NiFi at 
https://kisstechdocs.wordpress.com/

If you've got any questions or comments, you can mail 
dev@nifi.incubator.apache.org - you should get a pretty quick response.




> 
> On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pcgamer2426-
1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> 
> 
>  Hi Kevin,
>  
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new 
application that is still in incubation but, awesome tool to use for 
what you are looking for. Ithas a processor that put data and get data 
from HDFS and send continuously without having to use the put command. 
Check them out and let me know if you need help. I use it to put to HDFS 
also and put high volumes like you mentioned.
> 
> Date: Fri, 13 Feb 2015 09:25:35 -0500Subject: Re: Copying many files 
to HDFSFrom: kevin.macksamie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.orgTo: 
user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy 
the files in their entirety and keep their file names.
> 
> 
> 
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable 
multi-threaded application to copy files. I'll keep looking at it.
> 
> 
> 
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null 
<at> gmail.com> wrote:
> 
> Kevin,
> 
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
> 
> BR,
>  Alexander 
> 
> 
> 
> 
> 
> On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie-
Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a 
thousand or so files into HDFS, which totals roughly 1 TB. The cluster 
will be isolated on its own private LAN with a single client machine 
that is connected to the Hadoop cluster as well as the public network. 
The data that needs to be copied into HDFS is mounted as an NFS on the 
client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try 
and increase the throughput.
> 
> If these files were able to be accessed by each node in the Hadoop 
cluster, then I could write a MapReduce job to copy a number of files 
from the network into HDFS. I could not find anything in the 
documentation saying that `distcp` works with locally hosted files (its 
code in the tools package doesn't tell any sign of it either) - but I 
wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of 
client-local files to HDFS? I search the mail archives to find a similar 
question and I didn't come across one. I'm sorry if this is a duplicate 
question.
> 
> 
> Thanks for your time,
> Kevin
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  		 	   		  
> 
> 
> 
> 
> 


Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.

On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:

>
>  Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
>  ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
>  Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
>  On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
>  Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>

Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.

On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:

>
>  Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
>  ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
>  Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
>  On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
>  Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>

Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.

On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:

>
>  Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
>  ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
>  Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
>  On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
>  Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>

Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.

On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:

>
>  Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
>  ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
>  Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
>  On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
>  Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>

RE: Copying many files to HDFS

Posted by johny casanova <pc...@outlook.com>.
 Hi Kevin,

 

You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.



Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org


Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.




Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.


On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:



Kevin,


Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper


BR,
 Alexander 







On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:


Hi,


I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.


I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.


If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.


In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.



Thanks for your time,
Kevin

 		 	   		  

RE: Copying many files to HDFS

Posted by johny casanova <pc...@outlook.com>.
 Hi Kevin,

 

You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.



Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org


Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.




Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.


On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:



Kevin,


Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper


BR,
 Alexander 







On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:


Hi,


I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.


I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.


If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.


In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.



Thanks for your time,
Kevin

 		 	   		  

RE: Copying many files to HDFS

Posted by johny casanova <pc...@outlook.com>.
 Hi Kevin,

 

You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.



Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org


Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.




Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.


On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:



Kevin,


Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper


BR,
 Alexander 







On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:


Hi,


I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.


I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.


If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.


In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.



Thanks for your time,
Kevin

 		 	   		  

RE: Copying many files to HDFS

Posted by johny casanova <pc...@outlook.com>.
 Hi Kevin,

 

You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.



Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org


Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.




Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.


On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:



Kevin,


Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper


BR,
 Alexander 







On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:


Hi,


I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.


I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.


If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.


In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.



Thanks for your time,
Kevin

 		 	   		  

Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.

Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.

On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:

> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>

Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.

Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.

On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:

> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>

Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.

Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.

On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:

> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>

Re: Copying many files to HDFS

Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.

Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.

On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:

> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
>  Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>

Re: Copying many files to HDFS

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,

Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>

BR,
 Alexander 


> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
> 
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
> 
> Thanks for your time,
> Kevin


Re: Copying many files to HDFS

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,

Have a look at Apache Flume. It collects large amounts of data.

http://flume.apache.org/FlumeUserGuide.html

On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a 
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster 
> will be isolated on its own private LAN with a single client machine 
> that is connected to the Hadoop cluster as well as the public network. 
> The data that needs to be copied into HDFS is mounted as an NFS on the 
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try 
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop 
> cluster, then I could write a MapReduce job to copy a number of files 
> from the network into HDFS. I could not find anything in the 
> documentation saying that `distcp` works with locally hosted files 
> (its code in the tools package doesn't tell any sign of it either) - 
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of 
> client-local files to HDFS? I search the mail archives to find a 
> similar question and I didn't come across one. I'm sorry if this is a 
> duplicate question.
>
> Thanks for your time,
> Kevin

-- 
Regards,
Ahmed Ossama


Re: Copying many files to HDFS

Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,

What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?

Alex

On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:

> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>

Re: Copying many files to HDFS

Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,

What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?

Alex

On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:

> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>

Re: Copying many files to HDFS

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,

Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>

BR,
 Alexander 


> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
> 
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
> 
> Thanks for your time,
> Kevin


Re: Copying many files to HDFS

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,

Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>

BR,
 Alexander 


> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
> 
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
> 
> Thanks for your time,
> Kevin


Re: Copying many files to HDFS

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,

Have a look at Apache Flume. It collects large amounts of data.

http://flume.apache.org/FlumeUserGuide.html

On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a 
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster 
> will be isolated on its own private LAN with a single client machine 
> that is connected to the Hadoop cluster as well as the public network. 
> The data that needs to be copied into HDFS is mounted as an NFS on the 
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try 
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop 
> cluster, then I could write a MapReduce job to copy a number of files 
> from the network into HDFS. I could not find anything in the 
> documentation saying that `distcp` works with locally hosted files 
> (its code in the tools package doesn't tell any sign of it either) - 
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of 
> client-local files to HDFS? I search the mail archives to find a 
> similar question and I didn't come across one. I'm sorry if this is a 
> duplicate question.
>
> Thanks for your time,
> Kevin

-- 
Regards,
Ahmed Ossama


Re: Copying many files to HDFS

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,

Have a look at Apache Flume. It collects large amounts of data.

http://flume.apache.org/FlumeUserGuide.html

On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a 
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster 
> will be isolated on its own private LAN with a single client machine 
> that is connected to the Hadoop cluster as well as the public network. 
> The data that needs to be copied into HDFS is mounted as an NFS on the 
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try 
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop 
> cluster, then I could write a MapReduce job to copy a number of files 
> from the network into HDFS. I could not find anything in the 
> documentation saying that `distcp` works with locally hosted files 
> (its code in the tools package doesn't tell any sign of it either) - 
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of 
> client-local files to HDFS? I search the mail archives to find a 
> similar question and I didn't come across one. I'm sorry if this is a 
> duplicate question.
>
> Thanks for your time,
> Kevin

-- 
Regards,
Ahmed Ossama


Re: Copying many files to HDFS

Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,

What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?

Alex

On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:

> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>

Re: Copying many files to HDFS

Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,

Have a look at Apache Flume. It collects large amounts of data.

http://flume.apache.org/FlumeUserGuide.html

On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a 
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster 
> will be isolated on its own private LAN with a single client machine 
> that is connected to the Hadoop cluster as well as the public network. 
> The data that needs to be copied into HDFS is mounted as an NFS on the 
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try 
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop 
> cluster, then I could write a MapReduce job to copy a number of files 
> from the network into HDFS. I could not find anything in the 
> documentation saying that `distcp` works with locally hosted files 
> (its code in the tools package doesn't tell any sign of it either) - 
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of 
> client-local files to HDFS? I search the mail archives to find a 
> similar question and I didn't come across one. I'm sorry if this is a 
> duplicate question.
>
> Thanks for your time,
> Kevin

-- 
Regards,
Ahmed Ossama


Re: Copying many files to HDFS

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,

Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>

BR,
 Alexander 


> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
> 
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
> 
> Thanks for your time,
> Kevin