You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by NicoK <gi...@git.apache.org> on 2018/02/28 17:35:28 UTC

[GitHub] flink pull request #5601: [FLINK-8336][yarn/s3][tests] harden YarnFileStageT...

GitHub user NicoK opened a pull request:

    https://github.com/apache/flink/pull/5601

    [FLINK-8336][yarn/s3][tests] harden YarnFileStageTest upload test for eventual consistent read-after-write

    ## What is the purpose of the change
    
    In case the newly written object cannot be read (yet, e.g. due to eventually consistent read-after-write file systems like S3), we do 4 more retries to retrieve the value and wait 50ms each. While this does not solve all the cases it should make the (rare) case of the written object not being available for read even more unlikely.
    
    ## Brief change log
    
    - let `YarnFileStageTest` try 5 times waiting 50ms each before giving up on retrieving a file that should have been uploaded before
    
    ## Verifying this change
    
    This change is a trivial rework / code cleanup without any test coverage.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): **no**
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no**
      - The serializers: **no**
      - The runtime per-record code paths (performance sensitive): **no**
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **no**
      - The S3 file system connector: **no**
    
    ## Documentation
    
      - Does this pull request introduce a new feature? **no**
      - If yes, how is the feature documented? **not applicable**


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NicoK/flink flink-8336

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5601
    
----
commit 9d7293fa7419344955e35851895a992af9eb09e4
Author: Nico Kruber <ni...@...>
Date:   2018-02-27T16:29:00Z

    [FLINK-8336][yarn/s3][tests] harden YarnFileStageTest upload test for eventual consistent read-after-write
    
    In case the newly written object cannot be read (yet), we do 4 more retries to
    retrieve the value and wait 50ms each. While this does not solve all the cases
    it should make the (rare) case of the written object not being available for
    read even more unlikely.

----


---

[GitHub] flink issue #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageTest upl...

Posted by NicoK <gi...@git.apache.org>.
Github user NicoK commented on the issue:

    https://github.com/apache/flink/pull/5601
  
    also see further test changes in this regard in #5624


---

[GitHub] flink issue #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageTest upl...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/5601
  
    S3 is actually strongly consistent when reading newly created objects, just not in listing or renaming objects (files).
    
    The test seems to actually use reads of full paths, so wondering why there is a failure in the first place.
    
    If there is an issue that the Yarn upload code relies on eventually consistent operations, then fixing the test by retries may disguise the actual issue. If there is no eventually consistent operation, then this should not be necessary in the first place. I fear this change may be down a tricky path...
    
    Can you explain/double check why the failure happened and why the retry is necessary to stabilize the test, but the actual Yarn code is not affected by this?


---

[GitHub] flink pull request #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageT...

Posted by StefanRRichter <gi...@git.apache.org>.
Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5601#discussion_r172152959
  
    --- Diff: flink-yarn/src/test/java/org/apache/flink/yarn/YarnFileStageTest.java ---
    @@ -200,13 +201,23 @@ static void testCopyFromLocalRecursive(
     			while (targetFilesIterator.hasNext()) {
     				LocatedFileStatus targetFile = targetFilesIterator.next();
     
    -				try (FSDataInputStream in = targetFileSystem.open(targetFile.getPath())) {
    -					String absolutePathString = targetFile.getPath().toString();
    -					String relativePath = absolutePathString.substring(workDirPrefixLength);
    -					targetFiles.put(relativePath, in.readUTF());
    -
    -					assertEquals("extraneous data in file " + relativePath, -1, in.read());
    -				}
    +				int retries = 5;
    --- End diff --
    
    I wonder if this should be a magic number or better something that can be configured?


---

[GitHub] flink pull request #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageT...

Posted by StefanRRichter <gi...@git.apache.org>.
Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5601#discussion_r172152995
  
    --- Diff: flink-yarn/src/test/java/org/apache/flink/yarn/YarnFileStageTest.java ---
    @@ -200,13 +201,23 @@ static void testCopyFromLocalRecursive(
     			while (targetFilesIterator.hasNext()) {
     				LocatedFileStatus targetFile = targetFilesIterator.next();
     
    -				try (FSDataInputStream in = targetFileSystem.open(targetFile.getPath())) {
    -					String absolutePathString = targetFile.getPath().toString();
    -					String relativePath = absolutePathString.substring(workDirPrefixLength);
    -					targetFiles.put(relativePath, in.readUTF());
    -
    -					assertEquals("extraneous data in file " + relativePath, -1, in.read());
    -				}
    +				int retries = 5;
    +				do {
    +					try (FSDataInputStream in = targetFileSystem.open(targetFile.getPath())) {
    +						String absolutePathString = targetFile.getPath().toString();
    +						String relativePath = absolutePathString.substring(workDirPrefixLength);
    +						targetFiles.put(relativePath, in.readUTF());
    +
    +						assertEquals("extraneous data in file " + relativePath, -1, in.read());
    +						break;
    +					} catch (FileNotFoundException e) {
    +						// For S3, read-after-write may be eventually consistent, i.e. when trying
    +						// to access the object before writing it; see
    +						// https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
    +						// -> try again a bit later
    +						Thread.sleep(50);
    --- End diff --
    
    Same here.


---

[GitHub] flink issue #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageTest upl...

Posted by StefanRRichter <gi...@git.apache.org>.
Github user StefanRRichter commented on the issue:

    https://github.com/apache/flink/pull/5601
  
    LGTM 👍 


---

[GitHub] flink issue #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageTest upl...

Posted by NicoK <gi...@git.apache.org>.
Github user NicoK commented on the issue:

    https://github.com/apache/flink/pull/5601
  
    The main issue was actually described in FLINK-8801:
    
    According to https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
    
    > Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.
    
    
    Some S3 file system implementations may actually execute such a request for the about-to-write object and thus the read-after-write is only eventually consistent.


---

[GitHub] flink issue #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageTest upl...

Posted by NicoK <gi...@git.apache.org>.
Github user NicoK commented on the issue:

    https://github.com/apache/flink/pull/5601
  
    Different FS implementations try to circumvent the problem (or not), e.g. Presto's implementation implements the retries itself whenever it cannot find an object, while Hadoop's implementation did not (may depend on the version and may be subject to change).
    Also I don't know what the guarantees say about someone else reading your write - that would be the actual real usecase: artefact uploading by the client, downloading the artefacts via yarn on the worker nodes.


---

[GitHub] flink pull request #5601: [FLINK-8818][yarn/s3][tests] harden YarnFileStageT...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/5601


---