You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/03/05 00:28:41 UTC

[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347774#comment-14347774 ] 

Jason Lowe commented on MAPREDUCE-4815:
---------------------------------------

Took a closer look at the patch, patch looks pretty good to me.  Huge thanks to Siqi for sticking with the patch over many iterations and to Gera for doing detailed reviews!

I'm wondering about the use-case where the algorithm version changes _while_ the job is running, which seems weird to me.  The job conf would be changing between AM attempts which is dubious and would have ramifications beyond just this change.  IMHO trying to describe this use case in the property description text adds unnecessary confusion, unless I'm missing how this would happen in practice and how often it would occur.

Speaking of documentation, I do think it's important to point out in the docs the incompatibilities of algorithm 2 compared to algorithm 1 as I mentioned above (i.e.: more likely to leave partial output directly underneath the output directory if the job fails badly, resolution of output path collisions between tasks is no longer deterministic, etc).   Users will need to make sure they have mechanisms in place to verify they are not using partial output (i.e.: leveraging the _SUCCESS file, checking for a successful job status from the RM/JHS, etc.)

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v10.patch, MAPREDUCE-4815.v11.patch, MAPREDUCE-4815.v12.patch, MAPREDUCE-4815.v13.patch, MAPREDUCE-4815.v14.patch, MAPREDUCE-4815.v15.patch, MAPREDUCE-4815.v16.patch, MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch, MAPREDUCE-4815.v9.patch
>
>
> If a job generates many files to commit then the commitJob method call at the end of the job can take minutes.  This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.  The commit work was processed in parallel and overlapped the processing of outstanding tasks.  In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)