You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Klaas Bosteels (JIRA)" <ji...@apache.org> on 2009/09/29 12:34:16 UTC

[jira] Commented: (HADOOP-6290) AutoInputFormat + (larger) bzip2 files cause multiple runs over same file

    [ https://issues.apache.org/jira/browse/HADOOP-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760549#action_12760549 ] 

Klaas Bosteels commented on HADOOP-6290:
----------------------------------------

Presumably the problem is that {{AutoInputFormat}} does not implement the equivalent of {{TextInputFormat}}'s
{code}
protected boolean isSplitable(FileSystem fs, Path file) {
  return compressionCodecs.getCodec(file) == null;
}
{code}

> AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-6290
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.18.3
>            Reporter: Erik Forsberg
>
> Running a streaming job with the input directory containing a few .bzip2 files, each with a size of roughly 110MiB (compressed), with -inputformat
> org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed twice, i.e., if there are two bzip2 files in the directory, four mappers will be run. 
> Running a wordcount M/R job, the resulting count is doubled which indicates that each input file is analysed twice.
> This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en
> It seems this can't be reproduced on small files. It is possible the file has to be larger than the DFS blocksize, in my case set to 64MiB.
> I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid.
> Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.