You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2020/03/12 18:37:00 UTC

[jira] [Commented] (HADOOP-13007) cherry pick s3 ehancements from PrestoS3FileSystem

    [ https://issues.apache.org/jira/browse/HADOOP-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058166#comment-17058166 ] 

Steve Loughran commented on HADOOP-13007:
-----------------------------------------

h2. Presto features

I just had a big look through the code to see what is directory logic was.

* docs: https://prestosql.io/docs/current/connector/hive.html#amazon-s3-configuration
* source https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java

h3. Interesting points

* glacier objects can be skipped
* list always adds the trailing / for non-empty and then does the scan
* it doesn't create directories, so no need to recreate them


(I was about to say "demand create AWS client", but I misread it. Interesting thought though

h2. listLocatedStatus

does a list under path p + /

{code}
String key = keyFromPath(path);
if (!key.isEmpty()) {
    key += PATH_SEPARATOR;
}

ListObjectsRequest request = new ListObjectsRequest()
        .withBucketName(getBucketName(uri))
        .withPrefix(key)
        .withDelimiter(PATH_SEPARATOR);

{code}


mapping to stats can skip "GLACIER" == getStorageClass; they aren't visible at all.

{code}
    private Iterator<LocatedFileStatus> statusFromObjects(List<S3ObjectSummary> objects)
    {
        // NOTE: for encrypted objects, S3ObjectSummary.size() used below is NOT correct,
        // however, to get the correct size we'd need to make an additional request to get
        // user metadata, and in this case it doesn't matter.
        return objects.stream()
                .filter(object -> !object.getKey().endsWith(PATH_SEPARATOR))
                .filter(object -> !skipGlacierObjects || !isGlacierObject(object))
                .map(object -> new FileStatus(
                        object.getSize(),
                        false,
                        1,
                        BLOCK_SIZE.toBytes(),
                        object.getLastModified().getTime(),
                        qualifiedPath(new Path(PATH_SEPARATOR + object.getKey()))))
                .map(this::createLocatedFileStatus)
                .iterator();
    }
{code}

Note: this feeds into delete(); intentional or not

{code}
public boolean mkdirs(Path f, FsPermission permission)
{
    // no need to do anything for S3
    return true;
}         
{code}

* but getFileStatus does do: HEAD, HEAD + /, LIST + /

{code}
return new FileStatus(
        getObjectSize(path, metadata),
        S3_DIRECTORY_OBJECT_CONTENT_TYPE.equals(metadata.getContentType()),
        1,
        BLOCK_SIZE.toBytes(),
        lastModifiedTime(metadata),
        qualifiedPath(path));
        
{code}


* file length inferred from metadata -"x-amz-unencrypted-content-length" parsed. docs acknowlege that the results of a list are not consistent.
* dir marker content type used for "is this a dir"

h3. input stream

optimised for forward reads; 

* doesn't do a HEAD at open
* lazy seek
* skips range (default 1MB) before (non-draining) abort() and reopen
* simply opens with initial range in GET, but no limit
* maps 416 ->  EOFException 




h3. delete()

- doesn't do bulk deletes, just lists children and recursively calls delete on them (very, very inefficient)
-for recursive delete, if glacier files are skipped, they don't get deleted. Interesting idea


h3. Metrics

implements AWS SDK stats collection in com.facebook.presto.hive.s3.PrestoS3FileSystemMetricCollector; these feed back to the Presto metrics

h3. create()

if overwrite=true, skips all checks (even for dest being a dir). 
{code}
if ((!overwrite) && exists(path)) {
    throw new IOException("File already exists:" + path);
}
{code}
Should we do that? it's, well, aggressive, but for apps which know what they are doing...

1. A new transfer manager is created for each output stream
1. The entire file is written to the staging dir
1. the transfer manager is given the file to upload; it will partition and upload how it chooses

so: no incremental writes, disk always used. But good simplicitly and no risk of partial uploads remaining around.
omitting all existence checks does make for a faster write and avoids all 404s sneaking in.

h2. Overall analysis

It's nicely minimal; optimised for directory trees with no markers. 
* input stream seems a lot less optimised than our code, which works better for backwards seeking clients.
* output stream avoids 404s by omitting all probes. Something to consider
* Metric wireup from AWS SDK looks simple enough for us to copy
* it does handle CSE where actual length < list length,  but caller has to do a HEAD to see this; if it runs off the result of listLocatedStatus and then did a seek off EOF-16, things would fail. Similarly: splitting.


What can we adopt?

# metrics
# maybe: dest is dir for overwrite=true
# dir content type get/check (ie 


{code}
isDir == (path.endswith"/ && len==0) || content_type==_"application/x-directory")
{code}



> cherry pick s3 ehancements from PrestoS3FileSystem
> --------------------------------------------------
>
>                 Key: HADOOP-13007
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13007
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> Looking at [https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java], they've done some interesting things: configurable connection timeouts and, retry options, statistics to count exceptions caught/re-opened, and more
> review them, if there is good stuff there, add it to S3a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org