You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by akhil1988 <ak...@gmail.com> on 2009/06/04 20:24:33 UTC

Processing files lying in a directory structure

Hi! 

I am working on applying WordCount example on the entire Wikipedia dump. The
entire english wikipedia is around 200GB which I have stored in HDFS in a
cluster to which I have access. 
The problem: Wikipedia dump contains many directories (it has a very big
directory structure) containing HTML files but the FileInputFormat requires
all the files to be processed present in a single directory. 

Can anybody give any idea or if something already exists for applying
Wordcount on these HTML files present in the directories without changing
the directory strcuture.

Akhil
-- 
View this message in context: http://www.nabble.com/Processing-files-lying-in-a-directory-structure-tp23875340p23875340.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.