You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2017/12/04 14:22:00 UTC

[jira] [Commented] (HADOOP-15087) Write directly without creating temp directory to avoid rename

    [ https://issues.apache.org/jira/browse/HADOOP-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276837#comment-16276837 ] 

Steve Loughran commented on HADOOP-15087:
-----------------------------------------

The key flaw with the existing committers is not that they are are slow, it is in the absence of consistent file listings rename() can miss things to copy, so can deliver bad answers. You can't safely use the normal output committers against AWS S3 without S3Guard, though other implementations (yours too?) can be have differently.

Have you played with the S3A committers? I think we can outdo stocator with better failure semantics, though I've got to benchmark it properly. Why don't you have a go there?


Be good to see your patch though, as it'd be something to line up all  commit strategies, "class/broken", "s3a staging", "s3a magic", "stocator". 


FWIW, Teragen is a meaningless benchmark except as a stress test of a cluster and bootstrap to terasort tests; it doesn't resemble any real workloads. TCP-DS is the one to play with.

See also: http://steveloughran.blogspot.co.uk/2017/09/stocator-high-performance-object-store.html

> Write directly without creating temp directory to avoid rename 
> ---------------------------------------------------------------
>
>                 Key: HADOOP-15087
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15087
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Yonger
>
> Rename in workloads like Teragen/Terasort who use Hadoop default outputcommitters really hurt performance a lot. 
> Stocator announce it doesn't create the temporary directories any all, and still preserves Hadoop's fault tolerance. I add a switch when creating file via integrating it's code into s3a, I got 5x performance gain in Teragen and 15% performance improvement in Terasort.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org