You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Daniel Harper <Da...@bbc.co.uk> on 2019/10/07 07:51:29 UTC

Difficult to debug reason for checkpoint decline

We had an issue recently where no checkpoints were able to complete, with the following message in the job manager logs

2019-09-25 12:27:57,159 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Decline checkpoint 7041 by task 1f789ac3c5df655fe5482932b2255fd3 of job 214ccf9ab5edfb00f3bec3f454b57402.
2019-09-25 12:27:57,172 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding checkpoint 7041 of job 214ccf9ab5edfb00f3bec3f454b57402 because: Could not materialize checkpoint 7041 for operator uk.co.bbc.sawmill.streaming.pipeline.transformations.concurrentstreams.ConcurrentStreamsAggregator PERFORM COUNT DISTINCT OVER UUIDS FOR KEY -> ParDo(ToConcurrentStreamsResult)/ParMultiDo(ToConcurrentStreamsResult) -> JdbcIO.Write/ParDo(Write)/ParMultiDo(Write) (8/32).

This meant no checkpoints could ever complete until we restarted the job (we have the don’t fail on checkpoint failure flag set)

It’s difficult to debug why this happened though because, from inspecting the task manager logs for the affected task, there are no exceptions being reported during the affected times, and there is no stack trace in the job manager logs when the checkpoint gets declined/discarded.

I don’t know whether a stack trace would give more context or not, but I can see the log line being printed here https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1255<https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1347> – which doesn’t print the stack trace of the problem.

Is there something else we can look at to try and determine what happened?

Note this is not a recurring issue



Re: Difficult to debug reason for checkpoint decline

Posted by Chesnay Schepler <ch...@apache.org>.
There does indeed appear to be a code path in the StreamTask where an 
exception might not be logger on the TaskExecutor.
(StreamTask#handleExecutionException)

In FLINK-10753 the CheckpointCoordinator was adjusted to log the full 
stacktrace, and is part of 1.5.6.

On 07/10/2019 09:51, Daniel Harper wrote:
> We had an issue recently where no checkpoints were able to complete, 
> with the following message in the job manager logs
>
> 2019-09-25 12:27:57,159 INFO 
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline 
> checkpoint 7041 by task 1f789ac3c5df655fe5482932b2255fd3 of job 
> 214ccf9ab5edfb00f3bec3f454b57402.
> 2019-09-25 12:27:57,172 INFO 
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator - 
> Discarding checkpoint 7041 of job 214ccf9ab5edfb00f3bec3f454b57402 
> because: Could not materialize checkpoint 7041 for operator 
> uk.co.bbc.sawmill.streaming.pipeline.transformations.concurrentstreams.ConcurrentStreamsAggregator 
> PERFORM COUNT DISTINCT OVER UUIDS FOR KEY -> 
> ParDo(ToConcurrentStreamsResult)/ParMultiDo(ToConcurrentStreamsResult) 
> -> JdbcIO.Write/ParDo(Write)/ParMultiDo(Write) (8/32).
>
> This meant no checkpoints could ever complete until we restarted the 
> job (we have the don’t fail on checkpoint failure flag set)
>
> It’s difficult to debug why this happened though because, from 
> inspecting the task manager logs for the affected task, there are no 
> exceptions being reported during the affected times, and there is no 
> stack trace in the job manager logs when the checkpoint gets 
> declined/discarded.
>
> I don’t know whether a stack trace would give more context or not, but 
> I can see the log line being printed here 
> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1255 
> <https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1347> – 
> which doesn’t print the stack trace of the problem.
>
> Is there something else we can look at to try and determine what 
> happened?
>
> Note this is not a recurring issue
>
>