You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Chao Shi (JIRA)" <ji...@apache.org> on 2013/11/25 04:39:36 UTC

[jira] [Commented] (CRUNCH-165) Pipelines should automatically use CombineFileInputFormat where input consists of many small files

    [ https://issues.apache.org/jira/browse/CRUNCH-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831161#comment-13831161 ] 

Chao Shi commented on CRUNCH-165:
---------------------------------

Hi guys,

I encountered a problem raised by this patch. I'm using HFileInputFormat, which overrides FileInputFormat's listStatus() to pick some input files at deeper hierarchy. When I pass several input paths to HFileSource, it is wrapped into a CrunchCombineFileInputFormat. When CrunchCombineFileInputFormat#getSplits is called, it does not call the internal HFileInputFormat#listStatus. Instead, it calls FileInputFormat's. This behavior is implemented in CombineFileInputFormat.

{code}
      if (format instanceof FileInputFormat && !conf.getBoolean(RuntimeParameters.DISABLE_COMBINE_FILE, false)) {
        format = new CrunchCombineFileInputFormat<Object, Object>(job);
      }
{code}

A straight-forward fix is to change "format instanceof FileInputFormat" to "format.getClass() == FileInputFormat.class", but this limits this optimization to only sequence files. I'm looking for any better ideas.

> Pipelines should automatically use CombineFileInputFormat where input consists of many small files
> --------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-165
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-165
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>             Fix For: 0.8.0
>
>         Attachments: CRUNCH-165-jwills.patch, CRUNCH-165-v3.patch, CRUNCH-165-v4.patch, CRUNCH-165.patch
>
>
> Hive had a feature introduced in HIVE-74 whereby CombineFileInputFormat would be used if the input data consisted of many small files, making the resulting mapreduce jobs more efficient by giving individual mappers more data to process. This would be a nice feature for Crunch to have, too.



--
This message was sent by Atlassian JIRA
(v6.1#6144)