You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Sihua Zhou (JIRA)" <ji...@apache.org> on 2018/06/26 03:40:00 UTC

[jira] [Created] (FLINK-9661) TTL state should support to do time shift after restoring from checkpoint( savepoint).

Sihua Zhou created FLINK-9661:
---------------------------------

             Summary: TTL state should support to do time shift after restoring from checkpoint( savepoint).
                 Key: FLINK-9661
                 URL: https://issues.apache.org/jira/browse/FLINK-9661
             Project: Flink
          Issue Type: Improvement
          Components: State Backends, Checkpointing
    Affects Versions: 1.6.0
            Reporter: Sihua Zhou


The initial version of the TTL-state appends the expired timestamp along the state record, and check the expired timestamp with the condition {{expired_timestamp <= current_time}} when accessing the state, if it is true then the record is expired, otherwise it is still alive. This could works pretty fine in the most cases, but in some case, we need to do time shift, otherwise it may cause some unexpected result when using the ProccessTime, I roughly describe two case as follow.

- when restoring the job from the savepoint

For example, the user set the TTL to 2h for the state, if he trigger a savepoint and restore the job from the savepoint after 2h(maybe some reason that delay he to restore the job quickly), then the restored job's previous state data are all expired.

- when the job spend a long time to recover from a failure

For example, there are many jobs running on a yarn session cluster, and the cluster configured to use the DFS to store the checkpoint data, but unfortunately, the DFS meet a strange problem which makes the jobs on the cluster begin to loop in recovery-fail-recovery-fail... the devs spend some time to address the issue of DFS and the jobs start working properly, but if the "{{system down time >= TTL}}" then the job's previous state data will be expired in this case.

To avoid the problems as above, we need to do time shift after the job recovering from checkpoint & savepoint. A possible approach is outlined in [6186|https://github.com/apache/flink/pull/6186].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)