You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by akhil1988 <ak...@gmail.com> on 2009/06/04 20:24:33 UTC
Processing files lying in a directory structure
Hi!
I am working on applying WordCount example on the entire Wikipedia dump. The
entire english wikipedia is around 200GB which I have stored in HDFS in a
cluster to which I have access.
The problem: Wikipedia dump contains many directories (it has a very big
directory structure) containing HTML files but the FileInputFormat requires
all the files to be processed present in a single directory.
Can anybody give any idea or if something already exists for applying
Wordcount on these HTML files present in the directories without changing
the directory strcuture.
Akhil
--
View this message in context: http://www.nabble.com/Processing-files-lying-in-a-directory-structure-tp23875340p23875340.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.