You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Congxian Qiu(klion26) (Jira)" <ji...@apache.org> on 2019/12/03 09:54:00 UTC
[jira] [Updated] (FLINK-13861) No new checkpoint will be trigged when canceling an expired checkpoint failed

     [ https://issues.apache.org/jira/browse/FLINK-13861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Congxian Qiu(klion26) updated FLINK-13861:
------------------------------------------
    Fix Version/s:     (was: 1.10.0)

> No new checkpoint will be trigged when canceling an expired checkpoint failed
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-13861
>                 URL: https://issues.apache.org/jira/browse/FLINK-13861
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.7.2, 1.8.1, 1.9.0
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>
> I encountered this problem in our private fork of Flink, after taking a look at the current master branch of Apache Flink, I think the problem exists here also.
> Problem Detail:
>  1. checkpoint canceled because of expiration, so will call the canceller such as below
> {code:java}
> final Runnable canceller = () -> {
>    synchronized (lock) {
>       // only do the work if the checkpoint is not discarded anyways
>       // note that checkpoint completion discards the pending checkpoint object
>       if (!checkpoint.isDiscarded()) {
>          LOG.info("Checkpoint {} of job {} expired before completing.", checkpointID, job);
>          failPendingCheckpoint(checkpoint, CheckpointFailureReason.CHECKPOINT_EXPIRED);
>          pendingCheckpoints.remove(checkpointID);
>          rememberRecentCheckpointId(checkpointID);
>          triggerQueuedRequests();
>       }
>    }
> };{code}
>  
>  But failPendingCheckpoint may throw exceptions because it will call
> {{CheckpointCoordinator#failPendingCheckpoint}}
> -> {{PendingCheckpoint#abort}}
> ->  {{PendingCheckpoint#reportFailedCheckpoint}}
> -> initialize a FailedCheckpointStates,  may throw an exception by {{checkArgument}} 
> Did not find more about why there ever failed the {{checkArgument currently(this problem did not reproduce frequently)}}, will create an issue for that if I have more findings.
>  
> 2. when trigger checkpoint next, we'll first check if there already are too many checkpoints such as below
> {code:java}
> private void checkConcurrentCheckpoints() throws CheckpointException {
>    if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
>       triggerRequestQueued = true;
>       if (currentPeriodicTrigger != null) {
>          currentPeriodicTrigger.cancel(false);
>          currentPeriodicTrigger = null;
>       }
>       throw new CheckpointException(CheckpointFailureReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
>    }
> }
> {code}
> the {{pendingCheckpoints.zie() >= maxConcurrentCheckpoitnAttempts}} will always true
> 3. no checkpoint will be triggered ever from that on.
>  Because of the {{failPendingCheckpoint}} may throw Exception, so we may place the remove pending checkpoint logic in a finally block.
> I'd like to file a pr for this if this really needs to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)