You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2017/11/16 20:28:00 UTC

[jira] [Comment Edited] (HADOOP-14943) Add common getFileBlockLocations() emulation for object stores, including S3A

    [ https://issues.apache.org/jira/browse/HADOOP-14943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255544#comment-16255544 ] 

Steve Loughran edited comment on HADOOP-14943 at 11/16/17 8:27 PM:
-------------------------------------------------------------------

Two methods {{FileSystem.getFileBlockLocations(Path p, final long start, final long len)}}  and {{FileSystem.getFileBlockLocations(FileStatus fs, final long start, final long len)}}; first one does the getFileStatus; second one does the real work. It should be the only one which needs overriding

We can handle these by building a list of block locations

1. Divide up file length by block size
2. create a list that long


was (Author: stevel@apache.org):
Two methods {{FileSystem.getFileBlockLocations(Path p, final long start, final long len)}  and {{FileSystem.getFileBlockLocations(FileStatus fs, final long start, final long len)}; first one does the getFileStatus; second one does the real work. It should be the only one which needs overriding

We can handle these by building a list of block locations

1. Divide up file length by block size
2. create a list that long

> Add common getFileBlockLocations() emulation for object stores, including S3A
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-14943
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14943
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Critical
>         Attachments: HADOOP-14943-001.patch, HADOOP-14943-002.patch
>
>
> It looks suspiciously like S3A isn't providing the partitioning data needed in {{listLocatedStatus}} and {{getFileBlockLocations()}} needed to break up a file by the blocksize. This will stop tools using the MRv1 APIS doing the partitioning properly if the input format isn't doing it own split logic.
> FileInputFormat in MRv2 is a bit more configurable about input split calculation & will split up large files. but otherwise, the partitioning is being done more by the default values of the executing engine, rather than any config data from the filesystem about what its "block size" is,
> NativeAzureFS does a better job; maybe that could be factored out to hadoop-common and reused?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org