You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Kevin <ke...@gmail.com> on 2015/02/13 14:28:47 UTC
Copying many files to HDFS
Hi,
I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
or so files into HDFS, which totals roughly 1 TB. The cluster will be
isolated on its own private LAN with a single client machine that is
connected to the Hadoop cluster as well as the public network. The data
that needs to be copied into HDFS is mounted as an NFS on the client
machine.
I can run `hadoop fs -put` concurrently on the client machine to try and
increase the throughput.
If these files were able to be accessed by each node in the Hadoop cluster,
then I could write a MapReduce job to copy a number of files from the
network into HDFS. I could not find anything in the documentation saying
that `distcp` works with locally hosted files (its code in the tools
package doesn't tell any sign of it either) - but I wouldn't expect it to.
In general, are there any other ways of copying a very large number of
client-local files to HDFS? I search the mail archives to find a similar
question and I didn't come across one. I'm sorry if this is a duplicate
question.
Thanks for your time,
Kevin
Re: Copying many files to HDFS
Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,
What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?
Alex
On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
Re: Copying many files to HDFS
Posted by Mark Payne <ma...@hotmail.com>.
Kevin <ke...@...> writes:
>
>
> Johny, NiFi looks interesting but I can't really grasp how it will
help me. If you could provided some example code or a more detail
explanation of how you set up a topology, then that would be great.
Kevin,
With NiFi you wouldn't have example code. NiFi is a dataflow automation
tool where you construct your dataflow visually with drag-and-drop
components. You can download it by going to nifi.incubator.apache.org
and then going to the Downloads link. Once downloaded, you would untar
it and run "bin/nifi.sh start"
At that point you could build your dataflow by navigating your browser
to http://localhost:8080/nifi
There's actually a really good blog post on how to do essentially what
you're looking to do at http://ingest.tips/2014/12/22/getting-started-
with-apache-nifi/
The idea is that the dataflow pulls in any data from a local or network
drive into NiFi, deletes the file, and then pushes the data to HDFS. I
would caution though that in the blog post, failure to send to HDFS is
"auto-terminated," which means that the data would be deleted. In
reality, you should route the "failure" relationship back to PutHDFS. I
think this would make a lot more sense after you read the blog :)
There are also a lot of video tutorials on how to use NiFi at
https://kisstechdocs.wordpress.com/
If you've got any questions or comments, you can mail
dev@nifi.incubator.apache.org - you should get a pretty quick response.
>
> On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pcgamer2426-
1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org> wrote:
>
>
>
> Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
application that is still in incubation but, awesome tool to use for
what you are looking for. Ithas a processor that put data and get data
from HDFS and send continuously without having to use the put command.
Check them out and let me know if you need help. I use it to put to HDFS
also and put high volumes like you mentioned.
>
> Date: Fri, 13 Feb 2015 09:25:35 -0500Subject: Re: Copying many files
to HDFSFrom: kevin.macksamie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.orgTo:
user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy
the files in their entirety and keep their file names.
>
>
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.
>
>
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null
<at> gmail.com> wrote:
>
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
>
>
>
> On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie-
Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a
thousand or so files into HDFS, which totals roughly 1 TB. The cluster
will be isolated on its own private LAN with a single client machine
that is connected to the Hadoop cluster as well as the public network.
The data that needs to be copied into HDFS is mounted as an NFS on the
client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try
and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
cluster, then I could write a MapReduce job to copy a number of files
from the network into HDFS. I could not find anything in the
documentation saying that `distcp` works with locally hosted files (its
code in the tools package doesn't tell any sign of it either) - but I
wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
client-local files to HDFS? I search the mail archives to find a similar
question and I didn't come across one. I'm sorry if this is a duplicate
question.
>
>
> Thanks for your time,
> Kevin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.
On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:
>
> Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
> ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.
On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:
>
> Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
> ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.
On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:
>
> Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
> ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.
On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pc...@outlook.com>
wrote:
>
> Hi Kevin,
>
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
> ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
>
>
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
>
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
>
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
>
RE: Copying many files to HDFS
Posted by johny casanova <pc...@outlook.com>.
Hi Kevin,
You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.
Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper
BR,
Alexander
On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
Hi,
I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
Thanks for your time,
Kevin
RE: Copying many files to HDFS
Posted by johny casanova <pc...@outlook.com>.
Hi Kevin,
You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.
Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper
BR,
Alexander
On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
Hi,
I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
Thanks for your time,
Kevin
RE: Copying many files to HDFS
Posted by johny casanova <pc...@outlook.com>.
Hi Kevin,
You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.
Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper
BR,
Alexander
On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
Hi,
I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
Thanks for your time,
Kevin
RE: Copying many files to HDFS
Posted by johny casanova <pc...@outlook.com>.
Hi Kevin,
You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still in incubation but, awesome tool to use for what you are looking for. Ithas a processor that put data and get data from HDFS and send continuously without having to use the put command. Check them out and let me know if you need help. I use it to put to HDFS also and put high volumes like you mentioned.
Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wg...@gmail.com> wrote:
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper
BR,
Alexander
On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
Hi,
I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
Thanks for your time,
Kevin
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
Re: Copying many files to HDFS
Posted by Kevin <ke...@gmail.com>.
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:
> Kevin,
>
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
>
> BR,
> Alexander
>
>
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
>
>
Re: Copying many files to HDFS
Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>
BR,
Alexander
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
>
> Thanks for your time,
> Kevin
Re: Copying many files to HDFS
Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,
Have a look at Apache Flume. It collects large amounts of data.
http://flume.apache.org/FlumeUserGuide.html
On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster
> will be isolated on its own private LAN with a single client machine
> that is connected to the Hadoop cluster as well as the public network.
> The data that needs to be copied into HDFS is mounted as an NFS on the
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files
> from the network into HDFS. I could not find anything in the
> documentation saying that `distcp` works with locally hosted files
> (its code in the tools package doesn't tell any sign of it either) -
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a
> similar question and I didn't come across one. I'm sorry if this is a
> duplicate question.
>
> Thanks for your time,
> Kevin
--
Regards,
Ahmed Ossama
Re: Copying many files to HDFS
Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,
What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?
Alex
On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
Re: Copying many files to HDFS
Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,
What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?
Alex
On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
Re: Copying many files to HDFS
Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>
BR,
Alexander
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
>
> Thanks for your time,
> Kevin
Re: Copying many files to HDFS
Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>
BR,
Alexander
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
>
> Thanks for your time,
> Kevin
Re: Copying many files to HDFS
Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,
Have a look at Apache Flume. It collects large amounts of data.
http://flume.apache.org/FlumeUserGuide.html
On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster
> will be isolated on its own private LAN with a single client machine
> that is connected to the Hadoop cluster as well as the public network.
> The data that needs to be copied into HDFS is mounted as an NFS on the
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files
> from the network into HDFS. I could not find anything in the
> documentation saying that `distcp` works with locally hosted files
> (its code in the tools package doesn't tell any sign of it either) -
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a
> similar question and I didn't come across one. I'm sorry if this is a
> duplicate question.
>
> Thanks for your time,
> Kevin
--
Regards,
Ahmed Ossama
Re: Copying many files to HDFS
Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,
Have a look at Apache Flume. It collects large amounts of data.
http://flume.apache.org/FlumeUserGuide.html
On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster
> will be isolated on its own private LAN with a single client machine
> that is connected to the Hadoop cluster as well as the public network.
> The data that needs to be copied into HDFS is mounted as an NFS on the
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files
> from the network into HDFS. I could not find anything in the
> documentation saying that `distcp` works with locally hosted files
> (its code in the tools package doesn't tell any sign of it either) -
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a
> similar question and I didn't come across one. I'm sorry if this is a
> duplicate question.
>
> Thanks for your time,
> Kevin
--
Regards,
Ahmed Ossama
Re: Copying many files to HDFS
Posted by Alexander Pivovarov <ap...@gmail.com>.
Hi Kevin,
What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?
Alex
On Feb 13, 2015 5:29 AM, "Kevin" <ke...@gmail.com> wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
>
> Thanks for your time,
> Kevin
>
Re: Copying many files to HDFS
Posted by Ahmed Ossama <ah...@aossama.com>.
Hi Kevin,
Have a look at Apache Flume. It collects large amounts of data.
http://flume.apache.org/FlumeUserGuide.html
On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster
> will be isolated on its own private LAN with a single client machine
> that is connected to the Hadoop cluster as well as the public network.
> The data that needs to be copied into HDFS is mounted as an NFS on the
> client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try
> and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files
> from the network into HDFS. I could not find anything in the
> documentation saying that `distcp` works with locally hosted files
> (its code in the tools package doesn't tell any sign of it either) -
> but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a
> similar question and I didn't come across one. I'm sorry if this is a
> duplicate question.
>
> Thanks for your time,
> Kevin
--
Regards,
Ahmed Ossama
Re: Copying many files to HDFS
Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Kevin,
Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>
BR,
Alexander
> On 13 Feb 2015, at 14:28, Kevin <ke...@gmail.com> wrote:
>
> Hi,
>
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
>
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.
>
> If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
>
> In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question.
>
> Thanks for your time,
> Kevin