You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Niels Basjes <ni...@basj.es> on 2010/09/08 00:02:16 UTC

Understanding FileInputFormat and isSplittable.

Hi,

The last few weeks we built an application using Hadoop.
Because we're implementing against special logfiles (line oriented,
textual and gzipped) and we wanted to extract specific fields from
those file before putting it into our mapper. We chose to implement
our own derivative of the FileInputFormat class to do this.

All went well until we tried it with "big" (100MiB and bigger) files.
We then noticed that a lot of the values became doubled, and when we
put in full size production data for a test run we found "single
events" to be counted as "36".
It took a while to figure out what went wrong but essentially the root
cause was the fact that the isSplittable method returned true for
Gzipped files (which aren't splittable). The implementation that was
used was the one in FileInputFormat. The documentation for this method
states "Is the given filename splitable? Usually, true, but if the
file is stream compressed, it will not be.". This documentation gave
us the illusion that the default FileInputFormat implementation would
handle compression correctly. Because it all worked with small Gzipped
files we never expected the real implementation of this method to be
"return true;"

Because of this "true" value somehow the framework decided to read
each input file fully the number of times it wanted to split it. With
really messy effects in our case.

The derived TextInputFormat class does have a compression aware
implementation of isSplittable.

Given my current knowledge of Hadoop; I would have chosen to let the
default isSplittable implementation (i.e. the one in FileInputFormat)
be either "safe" (always return false) or "correct" (return whatever
is right for the applicable compression). The latter would make the
implementation match what I would expect from the documentation.

I would like to understand the logic behind the current implementation
choice in relation to what I expected (mainly from the documentation).

Thanks for explaining.

-- 
Best regards,

Niels Basjes