You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Sihua Zhou (JIRA)" <ji...@apache.org> on 2018/03/01 03:56:00 UTC

[jira] [Commented] (FLINK-8753) Introduce savepoint that go though the incremental checkpoint path

    [ https://issues.apache.org/jira/browse/FLINK-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381476#comment-16381476 ] 

Sihua Zhou commented on FLINK-8753:
-----------------------------------

Sorry for the interruption, but after have a look at the code of {{JobMaster#rescaleOperators}}, which is used for supporting online rescaling. I found the {{checkpoint & savepoint}} become a bit confused now. In {{JobMaster#rescaleOperators}} it triggers a savepoint that is called {{lastInternalSavepoint}}, it's name make me feeling that it not like the savepoint as that Aljoscha mentioned above(which will aim to be unified between backends finally). The `lastInternalSavepoint` is a savepoint that just aim to rescale the job, which is the same thing this JIRA wanted (but the performance is a problem because it also go though the fully checkpoint). So can I think that, what flink wants for {{checkpoint & savepoint}} are 3 different things:

- checkpoint, which doesn't support rescaling, just used for recover from failure, the best performance.
- internalSavepoint, which support rescaling, but is not unified between backends, highly performance but less than checkpoint. (maybe like the {{archive checkpoint}} that Stephan mentioned above)
- savepoint, which support rescaling, and is unified between backends, performance less than {{internalSavepoint}}.

Sorry for the interruption again, but can you help me to understand these? [~aljoscha][~StephanEwen]

> Introduce savepoint that go though the incremental checkpoint path
> ------------------------------------------------------------------
>
>                 Key: FLINK-8753
>                 URL: https://issues.apache.org/jira/browse/FLINK-8753
>             Project: Flink
>          Issue Type: New Feature
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.5.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Major
>
> Right now, savepoint goes through the full checkpoint path, take a savepoint could be slowly. In our production, for some long term job it often costs more than 10min to complete a savepoint which is unacceptable for a real time job, so we have to turn back to use the externalized checkpoint instead currently. But the externalized  checkpoint has a time interval (checkpoint interval) between the last time. So I proposal to introduce the increment savepoint which goes through the increment checkpoint path.
> Any advice would be appreciated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)