You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Timo Walther <fl...@twalthr.com> on 2014/12/15 00:41:05 UTC

Process directories containing large number of files

Hi all,

I'm working on a small project for university and I have some question 
about how to implement it. Maybe you could give me some hints....

I have a directory that contains around 1 million HTML files. Basically, 
I just want to read each file entirely into a String and parse it with 
JSoup in a Mapper. Do we have a InputFormat that can be used for this 
use case or do I have to implement my own FileInputFormat for that? :/ 
In general: Do you think creating InputSplits of the directory will work 
properly with 1 million FileStatus'es?


Regards,
Timo

Re: Process directories containing large number of files

Posted by Stephan Ewen <se...@apache.org>.
Hey!

You can try setting the minimum split size of the file input format so
large that only one split per file gets created.

You can probably reuse the delimited input Format when you choose a
delimiter that does not exist as a character sequence in the file. But just
reading the file stream into a string builder (through a reader that
decodes the charset) is probably quite straightforward as well.

It may make sense to add an option to the file input format to not split up
files...

Stephan