You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Josh Wills (JIRA)" <ji...@apache.org> on 2013/08/23 02:13:52 UTC

[jira] [Updated] (CRUNCH-256) SequentialFileNamingScheme should cache the # of files in the target directory after the first read

     [ https://issues.apache.org/jira/browse/CRUNCH-256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Wills updated CRUNCH-256:
------------------------------

    Attachment: CRUNCH-256.patch

I'd like [~gabriel.reid] to take a look at this one and double check that this won't break anything.
                
> SequentialFileNamingScheme should cache the # of files in the target directory after the first read
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-256
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-256
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.8.0
>
>         Attachments: CRUNCH-256.patch
>
>
> After a job finishes running, the post-job hooks rename the files from a temp output directory to the target output directory. When we have lots of files, this move can take a long time, and I traced the performance issue to the fact that SequentialFileNamingScheme does a listStatus() on the output directory for every file that gets moved. If SequentialFileNamingScheme just does this check once and then increments an internal counter, we can significantly decrease the performance overhead involved with the move.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira