You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Zhengguo 'Mike' SUN <zh...@yahoo.com> on 2009/06/01 22:52:12 UTC

what is the efficient way to implement InputFormat

Hi, All,

The input of my MapReduce job is two large txt files. And an InputSplit consists of a portion of the file from both files. And this Split is content dependent. So I have to read the input file to generate a split. Now the thing is that most of the time is spent in generating these splits. The Map and Reduce phases actually take less time than that. I was wondering if there is an efficient way to generate splits from files. My InputFormat class is based on FileInputFormat. The getSplits function of FileInputFormat doesn't read input file. But this is impossible for me because my split depends on the content of the file.

Any ideas or comments are appreciated.