You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/05/16 12:20:00 UTC

[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

    [ https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537508#comment-17537508 ] 

Steve Loughran commented on MAPREDUCE-7331:
-------------------------------------------

update on this. 

* the manifest committer is in branch-3.3 and will be in the next hadoop release off that branch. 
* it's also in the cloudera 7.2.15 runtime which shipped last week, though as preview rather than the default in azure and gcs deployments.

The committeer should be able to coexist who with other Jobs happening in parallel; it's currently you will have to disable job clean up for that coexistence to work.

I'll be happy to review and merge a PR which restricts that temp dir cleanup to _temporary/$jobID, so allow multiple I jobs to store the intermediate work side-by-side.

however, whoever supplies a PR to do this Wwill have to provide evidence that job commit will be safe in parallel execution, at least where the separate jobs are all writing data into the same partition tree. I would recommend reading "a zero rename committer" before trying to do so as we really do need a rigorous analysis here. Distributed commit protocols are hard!


> Make temporary directory used by FileOutputCommitter configurable
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7331
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 3.0.0
>         Environment: CDH 6.2.1 Hadoop 3.0.0
>            Reporter: Bimalendu Choudhary
>            Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files under a table directory. The hardcoded PENDING_DIR_NAME = _temporary directory results in multiple application using the same temporary directory. This casues unwanted results of one application interfering with other applications temporary files. Also one application ending up deleting temporary files of other. There is no way right now for applications to have there unique path to store the temporary files to avoid any interference from other totally independent applications.  I think the temporary directory being used by FileOutputCommitter should be made configurable to let the caller call with with its own unique value as per the requirement and avoid it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage failures where temporary directories from previous attempts cause problem and can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org