You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Mu Kong <ko...@gmail.com> on 2019/05/09 06:44:16 UTC

Checkpoint expired before completing with cleanupInRocksdbCompactFilter

Hi community,

I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter
to support state clean up for rocksdb backend.
We have an application that heavily relies on managed keyed store.
As we are using rocksdb as the state backend, we were suffering the issue
of ever-growing state size. To be more specific, our checkpoint size grows
into 200GB in 2 weeks.

After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl
config, the checkpoint size never grows over 10GB.
However, two days after upgrade, checkpointing started to fail because of
the "*Checkpoint expired before completing*".

From the log, I could not get anything useful.
But in the Flink UI, the last successful checkpoint took 1m to finish, and
our checkpoint timeout is set to 15m.
It seems that the checkpoint period became extremely long all of a sudden.

Is there anyway that I can further look into this? Or is there any
direction that I can tune the ttl for the application?

Thanks in advance!

Best regards,
Mu

Re: Checkpoint expired before completing with cleanupInRocksdbCompactFilter

Posted by Congxian Qiu <qc...@gmail.com>.
Hi, Mu
Is there anything  looks like `Received  late message for now expired checkpoint attempt ${checkpointID} from ${taskkExecutionID} of job ${jobID}` in JM log?

If yes, that means this task complete the checkpoint too long (maybe receive barrier too late, maybe spend too much time to do checkpoint, can investigate more from TM log);


Best
Congxian
On May 9, 2019, 14:44 +0800, Mu Kong <ko...@gmail.com>, wrote:
> Hi community,
>
> I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to support state clean up for rocksdb backend.
> We have an application that heavily relies on managed keyed store.
> As we are using rocksdb as the state backend, we were suffering the issue of ever-growing state size. To be more specific, our checkpoint size grows into 200GB in 2 weeks.
>
> After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl config, the checkpoint size never grows over 10GB.
> However, two days after upgrade, checkpointing started to fail because of the "Checkpoint expired before completing".
>
> From the log, I could not get anything useful.
> But in the Flink UI, the last successful checkpoint took 1m to finish, and our checkpoint timeout is set to 15m.
> It seems that the checkpoint period became extremely long all of a sudden.
>
> Is there anyway that I can further look into this? Or is there any direction that I can tune the ttl for the application?
>
> Thanks in advance!
>
> Best regards,
> Mu