You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/02/13 21:33:41 UTC

[jira] [Commented] (FLINK-5788) Document assumptions about File Systems and persistence

    [ https://issues.apache.org/jira/browse/FLINK-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864489#comment-15864489 ] 

ASF GitHub Bot commented on FLINK-5788:
---------------------------------------

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/3301

    [FLINK-5788] [docs] Improve documentation of FileSystem and spell out the data persistence contract

    This writes down the contract that the Flink `FileSystem` and `FSDataOutputStream` implementations have to adhere to in order to support proper consistency and failure recovery. The contract has so far been only implicitly defined and adhered to by the checkpointing and high-availability code.
    
    ## Contract
    
    Data written to an `FSDataOutputStream` created from a `FileSystem` is considered persistent, if two requirements are met:
    
      1. **Visibility Requirement:** It must be guaranteed that all other processes, machines,
         virtual machines, containers, etc. that are able to access the file see the data consistently
         when given the absolute file path. This requirement is similar to the *close-to-open*
         semantics defined by POSIX, but restricted to the file itself (by its absolute path).
    
      2. **Durability Requirement:** The file system's specific durability/persistence requirements
         must be met. These are specific to the particular file system. For example the
         `LocalFileSystem` does not provide any durability guarantees for crashes of both
         hardware and operating system, while replicated distributed file systems (like HDFS)
         guarantee typically durability in the presence of up to concurrent failure or *n*
         nodes, where *n* is the replication factor.
    
    Updates to the file's parent directory (such as that the file shows up when listing the directory contents) are not required to be complete for the data in the file stream to be considered persistent. This relaxation is important for file systems where updates to directory contents are only eventually consistent (like S3).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink filesystem_docs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3301.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3301
    
----

----


> Document assumptions about File Systems and persistence
> -------------------------------------------------------
>
>                 Key: FLINK-5788
>                 URL: https://issues.apache.org/jira/browse/FLINK-5788
>             Project: Flink
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>             Fix For: 1.3.0
>
>
> We should add some description about the assumptions we make for the behavior of {{FileSystem}} implementations to support proper checkpointing and recovery operations.
> This is especially critical for file systems like {{S3}} with a somewhat tricky contract.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)