You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Josh Wills (JIRA)" <ji...@apache.org> on 2013/09/02 18:12:51 UTC

[jira] [Resolved] (CRUNCH-165) Pipelines should automatically use CombineFileInputFormat where input consists of many small files

     [ https://issues.apache.org/jira/browse/CRUNCH-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Wills resolved CRUNCH-165.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0

Committed. That one feels pretty good.

Note that if you find that this trick hurts performance rather than improves it, you can disable it by setting the crunch.disable.combine.file property to true.
                
> Pipelines should automatically use CombineFileInputFormat where input consists of many small files
> --------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-165
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-165
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>             Fix For: 0.8.0
>
>         Attachments: CRUNCH-165-jwills.patch, CRUNCH-165.patch, CRUNCH-165-v3.patch, CRUNCH-165-v4.patch
>
>
> Hive had a feature introduced in HIVE-74 whereby CombineFileInputFormat would be used if the input data consisted of many small files, making the resulting mapreduce jobs more efficient by giving individual mappers more data to process. This would be a nice feature for Crunch to have, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira