You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Niels Basjes (JIRA)" <ji...@apache.org> on 2011/05/19 16:29:47 UTC
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niels Basjes updated MAPREDUCE-2094:
------------------------------------

    Attachment: MAPREDUCE-2094-2011-05-19.patch

I've created a patch that in my mind fixes this issue the correct way. If this is really the case if very open to discussion.

There are basically 4 current situations:
# People use the existing FileInputFormat derivatives. This patch ensures that all of those that are present in the existing code base (including the examples) still retain the same behavior as before.
# People have created a new derivative and HAVE overridden isSplitable with something that fits their needs. This patch does not change those situations.
# People have created a new derivative and have *NOT* overridden isSplitable. 
## If their input is in a splittable form (like LZO or uncompressed); then this patch will not affect them
## In the situation where they have big non-splittable input files (like gzip files) they will have ran into unexpected errors. Possibly they will spent a lot of time looking for performance issues and wrong results in production that did not occur during unit testing (we did!). This patch will fix this problem without any code changes in their code base.

> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2094
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0
>            Reporter: Niels Basjes
>         Attachments: MAPREDUCE-2094-2011-05-19.patch
>
>
> When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. 
> A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed.
> It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply "return true;". 
> This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. " . The actual implementation effectively does "Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. "
> For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does "return false; "
> Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable):
> # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes.
> # "Force" developers to think about it and make this method abstract.
> # Use a "safe" default (i.e. return false)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira