You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by twalthr <gi...@git.apache.org> on 2018/04/17 13:21:03 UTC

[GitHub] flink pull request #5861: [FLINK-9113] [connectors] Use raw local file syste...

GitHub user twalthr opened a pull request:

    https://github.com/apache/flink/pull/5861

    [FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss

    ## What is the purpose of the change
    
    This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.
    
    Negative effect: Existing checksums are not maintained anymore and thus become invalid.
    
    ## Brief change log
    
    - Replace local filesystem by raw filesystem
    
    
    ## Verifying this change
    
    Added a check for verifying the file length and file size.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): no
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
      - The serializers: no
      - The runtime per-record code paths (performance sensitive): no
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
      - The S3 file system connector: no
    
    ## Documentation
    
      - Does this pull request introduce a new feature? no
      - If yes, how is the feature documented? not applicable


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/twalthr/flink FLINK-9113

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5861.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5861
    
----
commit 17b85bd5fd65e6ec31374df0ca0af7451881d90a
Author: Timo Walther <tw...@...>
Date:   2018-04-17T13:12:55Z

    [FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss

----


---

[GitHub] flink pull request #5861: [FLINK-9113] [connectors] Use raw local file syste...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/5861


---

[GitHub] flink issue #5861: [FLINK-9113] [connectors] Use raw local file system for b...

Posted by twalthr <gi...@git.apache.org>.
Github user twalthr commented on the issue:

    https://github.com/apache/flink/pull/5861
  
    It seems that for Hadoop 2.8.3 truncating is supported for the raw local filesystems. I will need to adapt the test for that.


---

[GitHub] flink pull request #5861: [FLINK-9113] [connectors] Use raw local file syste...

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5861#discussion_r182452074
  
    --- Diff: flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java ---
    @@ -1245,6 +1246,12 @@ else if (scheme != null && authority == null) {
     			}
     
     			fs.initialize(fsUri, finalConf);
    +
    +			// By default we don't perform checksums on Hadoop's local filesystem and use the raw filesystem.
    --- End diff --
    
    The "by default" is not necessary anymore. We now always use the raw filesystem. This is a leftover from the previous version that allowed changing this.


---