You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by ramkrish86 <gi...@git.apache.org> on 2017/02/16 11:32:53 UTC

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

GitHub user ramkrish86 opened a pull request:

    https://github.com/apache/flink/pull/3334

    FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

    unsuccessful checkpoints
    
    Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
    If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html).
    In addition to going through the list, please provide a meaningful description of your changes.
    
    - [ ] General
      - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
      - The pull request addresses only one issue
      - Each commit in the PR has a meaningful commit message (including the JIRA id)
    
    - [ ] Documentation
      - Documentation has been added for new functionality
      - Old documentation affected by the pull request has been updated
      - JavaDoc for public methods has been added
    
    - [ ] Tests & Build
      - Functionality added by the pull request is covered by tests
      - `mvn clean verify` has been executed successfully locally or a Travis build has passed
    
    
    Ran mvn clean verify. Did not add test cases to know the first level feedback. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ramkrish86/flink FLINK-4810

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3334.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3334
    
----
commit 6e0fb38272e6bb59528065461c6ec6fdd43689ad
Author: Ramkrishna <ra...@intel.com>
Date:   2017-02-16T11:29:37Z

    FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n"
    unsuccessful checkpoints

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    @StephanEwen 
    No problem. I appreciate your time and efforts. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

Posted by shixiaogang <gi...@git.apache.org>.
Github user shixiaogang commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3334#discussion_r103604470
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -121,6 +121,8 @@
     
     	/** The maximum number of checkpoints that may be in progress at the same time */
     	private final int maxConcurrentCheckpointAttempts;
    +	/** The maximum number of unsuccessful checkpoints */
    +	private final int maxUnsuccessfulCheckpoints;
    --- End diff --
    
    I think `failed` is a better word than `unsuccessful`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by eliaslevy <gi...@git.apache.org>.
Github user eliaslevy commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    Any chance this will be merged now that 1.5 is out?


---

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3334#discussion_r103612320
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -121,6 +121,8 @@
     
     	/** The maximum number of checkpoints that may be in progress at the same time */
     	private final int maxConcurrentCheckpointAttempts;
    +	/** The maximum number of unsuccessful checkpoints */
    +	private final int maxUnsuccessfulCheckpoints;
    --- End diff --
    
    ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

Posted by shixiaogang <gi...@git.apache.org>.
Github user shixiaogang commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3334#discussion_r103605271
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) {
     				if (!checkpoint.isDiscarded()) {
     					checkpoint.abortError(new Exception("Failed to trigger checkpoint"));
     				}
    +				if(numUnsuccessful > maxUnsuccessfulCheckpoints) {
    +					return failExecution(executions);
    +				}
     				return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
     			}
     
     		} // end trigger lock
     	}
     
    +	private CheckpointTriggerResult failExecution(Execution[] executions) {
    +		if (currentPeriodicTrigger != null) {
    +			currentPeriodicTrigger.cancel();
    +			currentPeriodicTrigger = null;
    +		}
    +		for (Execution execution : executions) {
    +			// fail the graph
    +			execution.fail(new Throwable("The number of max unsuccessful checkpoints attempts exhausted"));
    --- End diff --
    
    I think it's not good here to fail the executions one by one. We should call `ExecutionGraph#fail` to fail the execution graph.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    @wenlong88 
    Can you tell more when you say checkpointing failure and trigger failure? I think if you are saying about tracking the number of times the execution fails after restoring from a checkpoint I think FLINK-4815 is trying to focus that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

Posted by shixiaogang <gi...@git.apache.org>.
Github user shixiaogang commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3334#discussion_r103612613
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -428,6 +450,9 @@ CheckpointTriggerResult triggerCheckpoint(
     			catch (Throwable t) {
     				int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet();
     				LOG.warn("Failed to trigger checkpoint (" + numUnsuccessful + " consecutive failed attempts so far)", t);
    +				if(numUnsuccessful > maxUnsuccessfulCheckpoints) {
    --- End diff --
    
    You are right. I missed it. Sorry for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    @ramkrish86 I would like to get to this one here after the additions to the checkpoint coordinator I am currently working on are done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    @StephanEwen - Ping for initial reviews. Will work on it based on the feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

Posted by shixiaogang <gi...@git.apache.org>.
Github user shixiaogang commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3334#discussion_r103605788
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -428,6 +450,9 @@ CheckpointTriggerResult triggerCheckpoint(
     			catch (Throwable t) {
     				int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet();
     				LOG.warn("Failed to trigger checkpoint (" + numUnsuccessful + " consecutive failed attempts so far)", t);
    +				if(numUnsuccessful > maxUnsuccessfulCheckpoints) {
    --- End diff --
    
    Here the counter records the total number of failed attempts. Since a streaming job is intended to run a quite long time, the number of failed attempts will eventually exceed the limit. We should use a different counter here which is reset once a pending checkpoint successfully completes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    I think I got what you are saying here. Since Execution#triggerCheckpoint is the actual checkpoint call and currently we don't track it if there is a failure. So your point is it is better know if there was a failure in actual checkpoint triggering at the Task level and then count that as a failure. Am I right @wenlong88 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    @StephanEwen 
    I saw in another JIRA one of your comment where you talked about refactoring CheckPointcoordinator and Pendingcheckpoint. So you woud this PR to wait till then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    Thank you for opening this pull request.
    I'll try to review it in the coming days...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by wenlong88 <gi...@git.apache.org>.
Github user wenlong88 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    Currently the `numUnsuccessfulCheckpointsTriggers` will be reset after a successful trigger instead of a successful checkpoint. But I think it is rare actually for triggering failure and monitoring checkpoint failure is more valuable. What do you guys think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3334#discussion_r103638771
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) {
     				if (!checkpoint.isDiscarded()) {
     					checkpoint.abortError(new Exception("Failed to trigger checkpoint"));
     				}
    +				if(numUnsuccessful > maxUnsuccessfulCheckpoints) {
    +					return failExecution(executions);
    +				}
     				return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
     			}
     
     		} // end trigger lock
     	}
     
    +	private CheckpointTriggerResult failExecution(Execution[] executions) {
    +		if (currentPeriodicTrigger != null) {
    +			currentPeriodicTrigger.cancel();
    +			currentPeriodicTrigger = null;
    +		}
    +		for (Execution execution : executions) {
    +			// fail the graph
    +			execution.fail(new Throwable("The number of max unsuccessful checkpoints attempts exhausted"));
    --- End diff --
    
    I verified the code once again. There is no reference to ExecutionGraph in Checkpointcoordinator and also calling fail on the current Execution actually triggers the restart flow to happen.
    Execution#fail()->Marks state to FAILED->vertex#executionFailed()->graph#jobVertexInFinalState(). So you think this way of failing won't work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3334: FLINK-4810 Checkpoint Coordinator should fail Exec...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3334#discussion_r103612421
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java ---
    @@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) {
     				if (!checkpoint.isDiscarded()) {
     					checkpoint.abortError(new Exception("Failed to trigger checkpoint"));
     				}
    +				if(numUnsuccessful > maxUnsuccessfulCheckpoints) {
    +					return failExecution(executions);
    +				}
     				return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
     			}
     
     		} // end trigger lock
     	}
     
    +	private CheckpointTriggerResult failExecution(Execution[] executions) {
    +		if (currentPeriodicTrigger != null) {
    +			currentPeriodicTrigger.cancel();
    +			currentPeriodicTrigger = null;
    +		}
    +		for (Execution execution : executions) {
    +			// fail the graph
    +			execution.fail(new Throwable("The number of max unsuccessful checkpoints attempts exhausted"));
    --- End diff --
    
    Ok sure. I will add tests for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    @StephanEwen , @wenlong88 , @shixiaogang 
    Pls have a look at the latest push. Now I am tracking the failures in the checkpointing and incrementing  a new counter based on it. Added test cases also. 
    I have not changed the constructors of the affected class because it touches many files. I can update it based on the feedback of the latest PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    I thinkI got a better way to trck this. Will update the PR sooner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    Just updated and did a force push to avoid the merge commit. Now things are fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    Thanks for the input. I read the code. There are two ways a checkpoint fails (as per my code understanding). If for some reason checkpointing cannot be performed we send DeclineCheckpoint message. That is handled by the Checkpointcoordinator.
    Another is if there is an external error in checkpointing and in that case we call failExternally. Which transitions the state to FAILED and closes all the watchdog, and cancels the invokable also. Now is the intent to track how many times this happens and if so track such occurences of failure and then fail the execution graph?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...

Posted by ramkrish86 <gi...@git.apache.org>.
Github user ramkrish86 commented on the issue:

    https://github.com/apache/flink/pull/3334
  
    Ping for reviews here!!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---