You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Steven Nelson <sn...@sourceallies.com> on 2019/09/26 19:33:09 UTC

Debugging slow/failing checkpoints

I am working with an application that hasn't gone to production yet. We run
Flink as a cluster within a K8s environment. It has the following attributes

1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS
2) 4 Task Managers
3) Configured to use RocksDB. The actual RocksDB files are configured to be
written to a locally attached NVMe drive.
4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.
5) There is currently very little load going through the system (it's in a
test environment). The web console indicates there isn't any Back Pressure
6) The cluster is running Flink 1.9.0
7) I don't see anything unexpected in the logs
8) Checkpoints take longer than 10 minutes with very little state (<1 mb),
they fail due to timeout
9) Eventually the job fails because it can't checkpoint.

What steps beyond what I have already done should I consider to debug this?

-Steve

Re: Debugging slow/failing checkpoints

Posted by Congxian Qiu <qc...@gmail.com>.
Hi  Steve

1. Do you use exactly once or at least once?
2. Do you use incremental or not
3. Do you have any timer, and where does the timer stored(Heap or RocksDB),
you can ref the config here[1], you can try store the timer in RocksDB.
4. Does the align time too long
5. You can check if it is sync duration took too long time or async
duration tool too long time.
6. If the io/network during the checkpoint has reached the limit

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/state_backends.html#rocksdb-state-backend-config-options
Best,
Congxian


Steven Nelson <sn...@sourceallies.com> 于2019年9月27日周五 上午3:33写道:

>
> I am working with an application that hasn't gone to production yet. We
> run Flink as a cluster within a K8s environment. It has the following
> attributes
>
> 1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS
> 2) 4 Task Managers
> 3) Configured to use RocksDB. The actual RocksDB files are configured to
> be written to a locally attached NVMe drive.
> 4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.
> 5) There is currently very little load going through the system (it's in a
> test environment). The web console indicates there isn't any Back Pressure
> 6) The cluster is running Flink 1.9.0
> 7) I don't see anything unexpected in the logs
> 8) Checkpoints take longer than 10 minutes with very little state (<1 mb),
> they fail due to timeout
> 9) Eventually the job fails because it can't checkpoint.
>
> What steps beyond what I have already done should I consider to debug this?
>
> -Steve
>
>
>
>