You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Mark Payne <ma...@hotmail.com> on 2015/03/04 23:30:39 UTC

Re: Copying many files to HDFS

Kevin <ke...@...> writes:

> 
> 
> Johny, NiFi looks interesting but I can't really grasp how it will 
help me. If you could provided some example code or a more detail 
explanation of how you set up a topology, then that would be great.



Kevin,

With NiFi you wouldn't have example code. NiFi is a dataflow automation 
tool where you construct your dataflow visually with drag-and-drop 
components. You can download it by going to nifi.incubator.apache.org 
and then going to the Downloads link. Once downloaded, you would untar 
it and run "bin/nifi.sh start"

At that point you could build your dataflow by navigating your browser 
to http://localhost:8080/nifi

There's actually a really good blog post on how to do essentially what 
you're looking to do at http://ingest.tips/2014/12/22/getting-started-
with-apache-nifi/

The idea is that the dataflow pulls in any data from a local or network 
drive into NiFi, deletes the file, and then pushes the data to HDFS. I 
would caution though that in the blog post, failure to send to HDFS is 
"auto-terminated," which means that the data would be deleted. In 
reality, you should route the "failure" relationship back to PutHDFS. I 
think this would make a lot more sense after you read the blog :)

There are also a lot of video tutorials on how to use NiFi at 
https://kisstechdocs.wordpress.com/

If you've got any questions or comments, you can mail 
dev@nifi.incubator.apache.org - you should get a pretty quick response.




> 
> On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pcgamer2426-
1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> 
> 
>  Hi Kevin,
>  
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new 
application that is still in incubation but, awesome tool to use for 
what you are looking for. Ithas a processor that put data and get data 
from HDFS and send continuously without having to use the put command. 
Check them out and let me know if you need help. I use it to put to HDFS 
also and put high volumes like you mentioned.
> 
> Date: Fri, 13 Feb 2015 09:25:35 -0500Subject: Re: Copying many files 
to HDFSFrom: kevin.macksamie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.orgTo: 
user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy 
the files in their entirety and keep their file names.
> 
> 
> 
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable 
multi-threaded application to copy files. I'll keep looking at it.
> 
> 
> 
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null 
<at> gmail.com> wrote:
> 
> Kevin,
> 
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
> 
> BR,
>  Alexander 
> 
> 
> 
> 
> 
> On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie-
Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a 
thousand or so files into HDFS, which totals roughly 1 TB. The cluster 
will be isolated on its own private LAN with a single client machine 
that is connected to the Hadoop cluster as well as the public network. 
The data that needs to be copied into HDFS is mounted as an NFS on the 
client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try 
and increase the throughput.
> 
> If these files were able to be accessed by each node in the Hadoop 
cluster, then I could write a MapReduce job to copy a number of files 
from the network into HDFS. I could not find anything in the 
documentation saying that `distcp` works with locally hosted files (its 
code in the tools package doesn't tell any sign of it either) - but I 
wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of 
client-local files to HDFS? I search the mail archives to find a similar 
question and I didn't come across one. I'm sorry if this is a duplicate 
question.
> 
> 
> Thanks for your time,
> Kevin
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  		 	   		  
> 
> 
> 
> 
>