You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Timo Walther <fl...@twalthr.com> on 2014/12/15 00:41:05 UTC
Process directories containing large number of files
Hi all,
I'm working on a small project for university and I have some question
about how to implement it. Maybe you could give me some hints....
I have a directory that contains around 1 million HTML files. Basically,
I just want to read each file entirely into a String and parse it with
JSoup in a Mapper. Do we have a InputFormat that can be used for this
use case or do I have to implement my own FileInputFormat for that? :/
In general: Do you think creating InputSplits of the directory will work
properly with 1 million FileStatus'es?
Regards,
Timo
Re: Process directories containing large number of files
Posted by Stephan Ewen <se...@apache.org>.
Hey!
You can try setting the minimum split size of the file input format so
large that only one split per file gets created.
You can probably reuse the delimited input Format when you choose a
delimiter that does not exist as a character sequence in the file. But just
reading the file stream into a string builder (through a reader that
decodes the charset) is probably quite straightforward as well.
It may make sense to add an option to the file input format to not split up
files...
Stephan