You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Congxian Qiu(klion26) (Jira)" <ji...@apache.org> on 2019/08/26 13:26:00 UTC
[jira] [Created] (FLINK-13861) No new checkpoint trigged when
canceling an expired checkpoint failed
Congxian Qiu(klion26) created FLINK-13861:
---------------------------------------------
Summary: No new checkpoint trigged when canceling an expired checkpoint failed
Key: FLINK-13861
URL: https://issues.apache.org/jira/browse/FLINK-13861
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.9.0, 1.8.1, 1.7.2
Reporter: Congxian Qiu(klion26)
Fix For: 1.10.0
I encountered this problem in our private fork of Flink, after taking a look at the current master branch of Apache Flink, I think the problem exists here also.
Problem Detail:
1. checkpoint canceled because of expiration, so will call the canceller such as below
{code:java}
final Runnable canceller = () -> {
synchronized (lock) {
// only do the work if the checkpoint is not discarded anyways
// note that checkpoint completion discards the pending checkpoint object
if (!checkpoint.isDiscarded()) {
LOG.info("Checkpoint {} of job {} expired before completing.", checkpointID, job);
failPendingCheckpoint(checkpoint, CheckpointFailureReason.CHECKPOINT_EXPIRED);
pendingCheckpoints.remove(checkpointID);
rememberRecentCheckpointId(checkpointID);
triggerQueuedRequests();
}
}
};{code}
But failPendingCheckpoint may throw exceptions because it will call
{{CheckpointCoordinator#failPendingCheckpoint}}
-> {{PendingCheckpoint#abort}}
-> {{PendingCheckpoint#reportFailedCheckpoint}}
-> initialize a FailedCheckpointStates, may throw an exception by {{checkArgument}}
Did not find more about why there ever failed the {{checkArgument currently(this problem did not reproduce frequently)}}, will create an issue for that if I have more findings.
2. when trigger checkpoint next, we'll first check if there already are too many checkpoints such as below
{code:java}
private void checkConcurrentCheckpoints() throws CheckpointException {
if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
triggerRequestQueued = true;
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
throw new CheckpointException(CheckpointFailureReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
}
}
{code}
the {{pendingCheckpoints.zie() >= maxConcurrentCheckpoitnAttempts}} will always true
3. no checkpoint will be triggered ever from that on.
Because of the {{failPendingCheckpoint}} may throw Exception, so we may place the remove pending checkpoint logic in a finally block.
I'd like to file a pr for this if this really needs to fix.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)