You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2013/09/18 11:02:52 UTC

[jira] [Commented] (HADOOP-9978) Support range reads in s3n interface to split objects for mappers to read

    [ https://issues.apache.org/jira/browse/HADOOP-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13770584#comment-13770584 ] 

Steve Loughran commented on HADOOP-9978:
----------------------------------------

There already is a seek() so that >1 mapper can read off different parts of the same S3 file, after that initial GET to read in the file header -using offsets But that file header is needed to
# determine the length of the blob
# meet the standard expectation "open() fails if the file  isn't there"

were it not for #2, we could delay the open until the first read & so save one round trip (more relevant long-haul than in-EC2), but people don't expect that. 

What S3n does do is pretend that there is a block size for the data, so that the splitter can split up a file by blocks, handing each block off to a different mapper. You can configure this with {{"fs.s3n.block.size"}}; it defaults to 64 MB -but you are free to make it smaller or larger.

Even if you run 60 mappers against a 4GB file, the bandwidth you will get off an S3 blob won't be 60x that of a single mapper. S3 doesn't do replication the way HDFS does, where the bandwidth is O(blocks*3). For S3 it is O(1). What does that mean? It means that you won't get any speedup at the map phase, though the different no. of mappers may make things better/worse at reduce time.


                
> Support range reads in s3n interface to split objects for mappers to read
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-9978
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9978
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Amandeep Khurana
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira