You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Erik Forsberg (JIRA)" <ji...@apache.org> on 2009/09/29 11:12:16 UTC

[jira] Created: (HADOOP-6290) AutoInputFormat + (larger) bzip2 files cause multiple runs over same file

AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
-------------------------------------------------------------------------

                 Key: HADOOP-6290
                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 0.18.3
            Reporter: Erik Forsberg


Running a streaming job with the input directory containing a few .bzip2 files, each with a size of roughly 110MiB (compressed), with -inputformat
org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed twice, i.e., if there are two bzip2 files in the directory, four mappers will be run. 

Running a wordcount M/R job, the resulting count is doubled which indicates that each input file is analysed twice.

This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en

It seems this can't be reproduced on small files. It is possible the file has to be larger than the DFS blocksize, in my case set to 64MiB.

I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid.

Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6290) AutoInputFormat + (larger) bzip2 files cause multiple runs over same file

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760549#action_12760549 ] 

Klaas Bosteels commented on HADOOP-6290:
----------------------------------------

Presumably the problem is that {{AutoInputFormat}} does not implement the equivalent of {{TextInputFormat}}'s
{code}
protected boolean isSplitable(FileSystem fs, Path file) {
  return compressionCodecs.getCodec(file) == null;
}
{code}

> AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-6290
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.18.3
>            Reporter: Erik Forsberg
>
> Running a streaming job with the input directory containing a few .bzip2 files, each with a size of roughly 110MiB (compressed), with -inputformat
> org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed twice, i.e., if there are two bzip2 files in the directory, four mappers will be run. 
> Running a wordcount M/R job, the resulting count is doubled which indicates that each input file is analysed twice.
> This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en
> It seems this can't be reproduced on small files. It is possible the file has to be larger than the DFS blocksize, in my case set to 64MiB.
> I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid.
> Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6290) AutoInputFormat + (larger) bzip2 files cause multiple runs over same file

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804769#action_12804769 ] 

Hudson commented on HADOOP-6290:
--------------------------------

Integrated in Hadoop-Common-trunk-Commit #147 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/147/])
    

> AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-6290
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.18.3
>            Reporter: Erik Forsberg
>
> Running a streaming job with the input directory containing a few .bzip2 files, each with a size of roughly 110MiB (compressed), with -inputformat
> org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed twice, i.e., if there are two bzip2 files in the directory, four mappers will be run. 
> Running a wordcount M/R job, the resulting count is doubled which indicates that each input file is analysed twice.
> This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en
> It seems this can't be reproduced on small files. It is possible the file has to be larger than the DFS blocksize, in my case set to 64MiB.
> I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid.
> Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6290) AutoInputFormat + (larger) bzip2 files cause multiple runs over same file

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804715#action_12804715 ] 

Hudson commented on HADOOP-6290:
--------------------------------

Integrated in Hadoop-Common-trunk #229 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/229/])
    

> AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-6290
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.18.3
>            Reporter: Erik Forsberg
>
> Running a streaming job with the input directory containing a few .bzip2 files, each with a size of roughly 110MiB (compressed), with -inputformat
> org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed twice, i.e., if there are two bzip2 files in the directory, four mappers will be run. 
> Running a wordcount M/R job, the resulting count is doubled which indicates that each input file is analysed twice.
> This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en
> It seems this can't be reproduced on small files. It is possible the file has to be larger than the DFS blocksize, in my case set to 64MiB.
> I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid.
> Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.