You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by tillrohrmann <gi...@git.apache.org> on 2017/10/17 09:05:15 UTC

[GitHub] flink pull request #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending ch...

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/4844

    [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoints for fine grained recovery

    ## What is the purpose of the change
    
    This commit will fail all pending checkpoints which have not been acknowledged by
    the failed task in case of fine grained recovery. This is done in order to avoid
    long checkpoint timeouts which might block the CheckpointCoordinator from triggering
    new checkpoints.
    
    ## Brief change log
    
    - Introduce `CheckpointCoordinator#failUnacknowledgedPendingCheckpointsFor` to fail all unacknowledged pending checkpoints for a given `ExecutionAttemptID`
    - Fail unacknowledged pending checkpoints in `ExecutionGraph#notifyExecutionChange`
    
    ## Verifying this change
    
    - `IndividualRestartsConcurrencyTest#testLocalFailureFailsPendingCheckpoints` tests that unacknowledged pending checkpoints are discarded and removed from the `CheckpointCoordinator` in case of a local failure
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink failCheckpointsFineGrainedRecovery

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4844.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4844
    
----
commit 454066b2f606f83e159e553506d58ce3a49a256d
Author: Till <ti...@gmail.com>
Date:   2017-10-17T08:57:37Z

    [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoints for fine grained recovery
    
    This commit will fail all pending checkpoints which have not been acknowledged by
    the failed task in case of fine grained recovery. This is done in order to avoid
    long checkpoint timeouts which might block the CheckpointCoordinator from triggering
    new checkpoints

----


---

[GitHub] flink pull request #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending ch...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/4844


---

[GitHub] flink pull request #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending ch...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4844#discussion_r147162652
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -1270,6 +1272,42 @@ public void run() {
     	}
     
     	/**
    +	 * Discards the given pending checkpoint because of the given cause.
    +	 *
    +	 * @param pendingCheckpoint to discard
    +	 * @param cause for discarding the checkpoint
    +	 */
    +	private void discardCheckpoint(PendingCheckpoint pendingCheckpoint, @Nullable Throwable cause) {
    +		Thread.holdsLock(lock);
    --- End diff --
    
    Should that be an `assert(Thread.holdsLock(lock));` or a `Preconditions.checkState(Thread.holdsLock(lock));`?


---

[GitHub] flink pull request #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending ch...

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4844#discussion_r147415027
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -1270,6 +1272,42 @@ public void run() {
     	}
     
     	/**
    +	 * Discards the given pending checkpoint because of the given cause.
    +	 *
    +	 * @param pendingCheckpoint to discard
    +	 * @param cause for discarding the checkpoint
    +	 */
    +	private void discardCheckpoint(PendingCheckpoint pendingCheckpoint, @Nullable Throwable cause) {
    +		Thread.holdsLock(lock);
    --- End diff --
    
    Yes definitely.


---

[GitHub] flink issue #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoin...

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/4844
  
    @zhenzhongxu what kind of metrics would you be interested in? I think we should open a dedicated issue for that.


---

[GitHub] flink issue #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoin...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4844
  
    Had an offline discussion with @tillrohrmann - rewriting this without Mockito results in a similar amount of code with similar maintenance effort, so seems to be okay in this case.
    
    +1 to merge after fixing the `Thread.holdsLock(lock)` comment above


---

[GitHub] flink issue #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoin...

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/4844
  
    Thanks for the review @StephanEwen. I will address your comments and then merge this PR.


---

[GitHub] flink issue #4844: [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoin...

Posted by zhenzhongxu <gi...@git.apache.org>.
Github user zhenzhongxu commented on the issue:

    https://github.com/apache/flink/pull/4844
  
    @tillrohrmann created https://issues.apache.org/jira/browse/FLINK-7894


---