You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/12 17:41:00 UTC
[jira] [Commented] (FLINK-5706) Implement Flink's own S3 filesystem

    [ https://issues.apache.org/jira/browse/FLINK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202332#comment-16202332 ] 

ASF GitHub Bot commented on FLINK-5706:
---------------------------------------

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/4818

    [FLINK-5706] [file systems] Add S3 file systems without Hadoop dependencies

    ## What is the purpose of the change
    
    This adds two implementations of a file system that write to S3 so that users can use Flink with S3 without depending on Hadoop and have an alternative to Hadoop's S3 connectors.
    
    Both are not actual re-implementations but wrap other implementations and shade dependencies.
    
    1. The first is a wrapper around Hadoop's s3a file system. By pulling a smaller dependency tree and shading all dependencies away, this keeps the appearance of Flink being Hadoop-free, from a dependency perspective. We can also bump the shaded Hadoop dependency here to get improvements to s3a in (as in Hadoop 3.0) without causing dependency conflicts.
    
    2. The second S3 file system is from the Presto Project. Initial simple tests seem to indicate that it responds slightly faster and in a bit more lightweight manner to write/read/list requests, compared to the Hadoop s3a FS, but it has some semantical differences. For example, creating a directory does not mean the file system recognized that the directory is there. The directory is only recognized as existing once files are inserted. For checkpointing, that could even be preferable.
    
    Both file systems register themselves under `s3` to not overload the `s3n` and `s3a` schemes used by Hadoop,
    
    ## Brief change log
    
      - Adds `flink-filesystems/flink-s3-fs-hadoop`
      - Adds `flink-filesystems/flink-s3-fs-presto`
    
    ## Verifying this change
    
    This adds some initial integration tests, which do depend on S3 credentials. These credentials are not in the code, but only encrypted on Travis, which is why the tests can only run in a meaningful way either on the `apache/flink` master branch, or in a committer repository when the committer enabled Travis uploads to S3 (for logs) - the tests here use the same secret credentials.
    
    Since this does not implement the actual S3 communication, we have no tests for that. The tests only test instantiation and whether S3 communication can be established (simple reads/writes to a bucket, listing, etc).
    
    Change can also be verified by building Flink, pulling the respective S3 FS jar from `/opt` into `/lib` and running a workload that checkpoints or writes to S3.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (**yes** / no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
      - The serializers: (yes / **no** / don't know)
      - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
    
    Proper behavior of the File Systems is important, otherwise checkpointing may fail. In some sense we are already relying on proper tests of HDFS and S3 connectors by the Hadoop project. This adds a similar dependency.
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (**yes** / no)
      - If yes, how is the feature documented? (not applicable / docs / JavaDocs / **not documented**)
    
    Will add documentation once the details of this feature are agreed upon.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink fs_s3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4818
    
----
commit a0b89ad02d4cb67c4d5e1e28efcb6872af0540e6
Author: Stephan Ewen <se...@apache.org>
Date:   2017-10-06T15:41:00Z

    [FLINK-5706] [file systems] Add S3 file systems without Hadoop dependencies
    
    This adds two implementations of a file system that write to S3.
    Both are not actual re-implementations but wrap other implementations and shade dependencies.
    
    (1) A wrapper around Hadoop's s3a file system. By pulling a smaller dependency tree and
        shading all dependencies away, this keeps the appearance of Flink being Hadoop-free,
        from a dependency perspective.
    
    (2) The second S3 file system is from the Presto Project.
        Initial simple tests seem to indicate that it responds slightly faster
        and in a bit more lightweight manner to write/read/list requests, compared
        to the Hadoop s3a FS, but it has some semantical differences.

----


> Implement Flink's own S3 filesystem
> -----------------------------------
>
>                 Key: FLINK-5706
>                 URL: https://issues.apache.org/jira/browse/FLINK-5706
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>
> As part of the effort to make Flink completely independent from Hadoop, Flink needs its own S3 filesystem implementation. Currently Flink relies on Hadoop's S3a and S3n file systems.
> An own S3 file system can be implemented using the AWS SDK. As the basis of the implementation, the Hadoop File System can be used (Apache Licensed, should be okay to reuse some code as long as we do a proper attribution).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)