You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by NicoK <gi...@git.apache.org> on 2017/07/21 13:45:28 UTC

[GitHub] flink pull request #4381: [FLINK-7196][blob] add a TTL to all transient BLOB...

GitHub user NicoK opened a pull request:

    https://github.com/apache/flink/pull/4381

    [FLINK-7196][blob] add a TTL to all transient BLOBs

    ## What is the purpose of the change
    
    Transient BLOB files are currently only deleted manually and may thus linger around if not cleaned up. We should introduce a default cleanup time, i.e. a time-to-live (TTL), as a backup cleanup path.
    
    ## Brief change log
    
      - Any `TransientBlobService`, i.e. `BlobServer` and `TransientBlobCache`, gets an additional member keeping track of all expiry times of transient BLOBs.
      - Whenever a transient BLOB enters the local store of a `TransientBlobService`, its TTL will be set to `currentTime + cleanupInterval`.
      - A cleanup task (`TransientBlobCleanupTask`) is executed every `cleanupInterval` seconds and deletes all expired transient BLOBs.
      - As a result, transient BLOBs stay at most `2 * cleanupInterval` seconds before getting deleted automatically.
    
    Please note that this is based on #4359 and should conclude FLIP-19.
    
    ## Verifying this change
    
    This change adds tests and can be verified as follows:
    
      - a new test verifies (for each `BlobServer` and `TransientBlobCache` with and without a job ID) that after adding two transient BLOBs with further accesses only to one of them, that the accessed one remains and the other is deleted. For the existence check, however, there is a slight race condition: if the test is slow and delayed by 1s, the cleanup may kick in and the test may fail. This is hard to test otherwise without further intrusion into the blob store classes.
      - FYI: since transient BLOBs are not used yet, this cannot be tested with a complete flink distribution yet.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): **(no)**
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **(no)**
      - The serializers: **(no)**
      - The runtime per-record code paths (performance sensitive): **(no)**
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **(yes)**:
    
    ## Documentation
    
      - Does this pull request introduce a new feature? **(yes)**
      - If yes, how is the feature documented? **(JavaDocs)**

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NicoK/flink flink-6916-7196

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4381.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4381
    
----
commit 1f86efd013a81e84ba1556de0a04e4ac70229f79
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-22T15:31:17Z

    [FLINK-7053][blob] remove code duplication in BlobClientSslTest
    
    This lets BlobClientSslTest extend BlobClientTest as most of its implementation
    came from there and was simply copied.

commit ff083e850edb8f5f383f54f19519116e10308d61
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-23T09:40:34Z

    [FLINK-7053][blob] verify some of the buffers returned by GET

commit e138569019f5c75b70a21085e4829cb6cd1e93bc
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-23T10:04:10Z

    [FLINK-7053][blob] use TemporaryFolder for local BLOB dir in unit tests
    
    This replaces the use of some temporary directory where it is not guaranteed
    that it will be deleted after the test.

commit 70f06d9b38c5d8afb50b6c18b7b3a4e07e2bb3da
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-21T12:45:31Z

    [FLINK-7054][blob] remove LibraryCacheManager#getFile()
    
    This was only used in tests where it is avoidable but if used anywhere else, it
    may have caused cleanup issues.

commit 5463f65aeb3ed519b51ccc78abfcc003bf02b3f8
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-16T21:13:56Z

    [FLINK-7054][hotfix] fix a checkstyle error

commit 12052c0b8c8b869f9b6bd6f9d53c9ed6e362c631
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-21T14:14:15Z

    [FLINK-7055][blob] refactor getURL() to the more generic getFile()
    
    The fact that we always returned URL objects is a relic of the BlobServer's only
    use for URLClassLoader. Since we'd like to extend its use, returning File
    objects instead is more generic.

commit 9dd7ba652ff1b09fb5bf44e8b6b5b885a56873ab
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-21T16:04:43Z

    [FLINK-7056][blob] add API to allow job-related BLOBs to be stored

commit 90f779ba66728979977bd794e8f68b207471dbc2
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-23T17:17:07Z

    [FLINK-7056][blob] refactor the new API for job-related BLOBs
    
    For a cleaner API, instead of having a nullable jobId parameter, use two methods:
    one for job-related BLOBs, another for job-unrelated ones.

commit 6d1fda80c6f9869676030fd86177bbc7e981b1f9
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-16T21:22:12Z

    [FLINK-7056][hotfix] fix a checkstyle error

commit cdbb0fb32052c8163bc83df151a0c8354846e4dc
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-27T16:29:44Z

    [FLINK-7057][blob] move ref-counting from the LibraryCacheManager to the BlobCache
    
    Also change from BlobKey-based ref-counting to job-based ref-counting which is
    simpler and the mode we want to use from now on. Deferred cleanup (as before)
    is currently not implemented yet (TODO).
    At the BlobServer, no ref-counting will be used but the cleanup will happen
    when the job enters a final state (TODO).

commit 8617c9e8ffbdece8d02815ef71dc09f2d764453a
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-28T09:31:39Z

    [FLINK-7057][blob] change to a cleaner API for BlobService#registerJob()

commit 20cf9bf1fd2b382b6d45dc7b15c7be9bed7f5f16
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-28T12:09:11Z

    [FLINK-7057][blob] implement deferred cleanup at the BlobCache
    
    Whenever a job is not referenced at the BlobCache anymore, we set a TTL and let
    the cleanup task remove it when this is hit and the task is run. For now, this
    means that a BLOB will be retained at most
    (2 * ConfigConstants.LIBRARY_CACHE_MANAGER_CLEANUP_INTERVAL) seconds after not
    being referenced anymore. We do this so that a recovery still has the chance to
    use existing files rather than to download them again.

commit 270f3957b6d5e9061d528158a86eb1837ad2786c
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-28T15:17:06Z

    [FLINK-7057][blob] integrate cleanup of job-related JARs from the BlobServer
    
    TODO: an integration test that verifies that this is actually done when desired
    and not performed when not, e.g. if the job did not reach a final execution
    state

commit 39300a32c90c82f8044cfd6d5fcaf1b83b250c87
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-30T12:52:19Z

    [FLINK-7057][tests] extract FailingBlockingInvokable from CoordinatorShutdownTest

commit 56113349799669326074d269d6c07bb19bd45788
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-06-30T12:56:14Z

    [FLINK-7057][blob] add an integration test for the BlobServer cleanup
    
    This ensures that BLOB files are actually deleted when a job enters a final
    state.

commit 93861f1f624c5b514fcdd041b592e0c40c66d103
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-03T09:25:29Z

    [FLINK-7057][tests] refrain from catching an exception just to fail the test
    
    removes code like this in the BLOB store unit tests:
    
    catch (Exception e) {
        e.printStackTrace();
        fail(e.getMessage());
    }

commit 80a05142a68f8a3d0aaf1d447ab87a1e506364a4
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-03T11:45:33Z

    [FLINK-7057][blob] fix BlobServer#cleanupJob() being too eager
    
    Instead of deleting the job's directory, it was deleting the parent storage
    directory.

commit de4ca2a7c676e7e1a82172ce680ae5a5f6693df4
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-03T15:03:50Z

    [FLINK-7057][blob] fix BlobServer cleanup integration
    
    * the test did not check the correct directories for cleanup
    * the test did not honour the test timeout

commit dc58fa5d5ffc299b493dc1a78f26aa1914036151
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-03T15:11:18Z

    [FLINK-7057][blob] test and fix BlobServer cleanup for a failed job submission

commit d1de87bfd34c5bb1f2dde0402bcceac6013c150b
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-04T07:55:29Z

    [FLINK-7057][blob] rework the LibraryCacheManager API
    
    Since ref-counting has moved to the BlobCache, the BlobLibraryCacheManager is
    just a thin wrapper to get a user class loader by retrieving BLOBs from the
    BlobCache/BlobServer. Therefore, move the job-registration/-release out of it,
    too, and restrict its use to the task manager where the BlobCache is used (on
    the BlobServer, jobs do not need registration since they are only used once and
    will be deleted when they enter a final state).
    
    This makes the BlobServer and BlobCache instances available at the JobManager
    and TaskManager instances, respectively, also enabling future use cases outside
    of the LibraryCacheManager.

commit b62f1f640e33126523eaaa7bd51f0b24962c67fd
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-04T09:50:07Z

    [hotfix] increase Scala checkstyle maxParameters to 20

commit 5e9a9a7cd779a8e0d28947bcfa64cc16c02857f2
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-04T09:51:28Z

    [FLINK-7057][blob] address PR comments

commit 5fa960676281fc1473105c849fbce9cb22cacacb
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-04T12:41:18Z

    [FLINK-7057][blob] fix JobManagerLeaderElectionTest

commit a6ca6812af7b4130e77775fd3896f3f44a357334
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-04T20:49:12Z

    [FLINK-7057][blob] re-introduce some ref-counting for BlobLibraryCacheManager
    
    Apparently, we do need to return the same ClassLoader for different (parallel)
    tasks of a job running on the same task manager. Therefore, keep the initial
    task registration implementation that was removed with
    8331fbb208d975e0c1ec990344c14315ea08dd4a and only adapt it here. This also
    restores some tests and adds new combinations not tested before.

commit da1be581529813ec359f1ac09a00bc564a3812dd
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-06T07:38:17Z

    [FLINK-7057][blob] address PR comments

commit b02807e75cea960fd2c89e49656be2e6f6e32394
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-10T11:40:00Z

    [FLINK-7057][tests] fix (manual/ignored) BlobCacheCleanupTest#testJobDeferredCleanup()

commit 0f9f7d7f90497708cd2589b34573f10be0b9d331
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-16T21:30:43Z

    [FLINK-7057][hotfix] fix a checkstyle error

commit ba540ae1445a8144aff389c5056f34b54a7cbbdb
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-20T09:16:38Z

    [FLINK-7057][blob] remove the extra lock object from BlobCache
    
    We can lock on jobRefCounters instead, which is what we are guarding anyway.

commit 871a590b5c1b3897e26e516e027a86345e3a3f4f
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-20T09:17:21Z

    [FLINK-7057][blob] minor improvements to the TTL in BlobCache
    
    Do not use Long.MAX_VALUE as a code for "keep forever". Also add more comments.

commit c48575fe81367810f2fd5ae082170f349d3e7176
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-20T15:12:24Z

    [FLINK-7057][blob] replace "library-cache-manager.cleanup.interval" with "blob.service.cleanup.interval"
    
    Since we moved the cleanup to the BLOB service classes, this only makes sense.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4381: [FLINK-7196][blob] add a TTL to all transient BLOBs

Posted by NicoK <gi...@git.apache.org>.
Github user NicoK commented on the issue:

    https://github.com/apache/flink/pull/4381
  
    rebased onto the latest #4359 


---

[GitHub] flink pull request #4381: [FLINK-7196][blob] add a TTL to all transient BLOB...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/4381


---