You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Klaas Bosteels (JIRA)" <ji...@apache.org> on 2009/09/29 12:34:16 UTC
[jira] Commented: (HADOOP-6290) AutoInputFormat + (larger) bzip2
files cause multiple runs over same file
[ https://issues.apache.org/jira/browse/HADOOP-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760549#action_12760549 ]
Klaas Bosteels commented on HADOOP-6290:
----------------------------------------
Presumably the problem is that {{AutoInputFormat}} does not implement the equivalent of {{TextInputFormat}}'s
{code}
protected boolean isSplitable(FileSystem fs, Path file) {
return compressionCodecs.getCodec(file) == null;
}
{code}
> AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
> -------------------------------------------------------------------------
>
> Key: HADOOP-6290
> URL: https://issues.apache.org/jira/browse/HADOOP-6290
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.18.3
> Reporter: Erik Forsberg
>
> Running a streaming job with the input directory containing a few .bzip2 files, each with a size of roughly 110MiB (compressed), with -inputformat
> org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed twice, i.e., if there are two bzip2 files in the directory, four mappers will be run.
> Running a wordcount M/R job, the resulting count is doubled which indicates that each input file is analysed twice.
> This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en
> It seems this can't be reproduced on small files. It is possible the file has to be larger than the DFS blocksize, in my case set to 64MiB.
> I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid.
> Please let me know if I need to provider further details.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.