You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yu Li (Jira)" <ji...@apache.org> on 2020/02/10 06:21:00 UTC

[jira] [Updated] (FLINK-14685) ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK

     [ https://issues.apache.org/jira/browse/FLINK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yu Li updated FLINK-14685:
--------------------------
    Fix Version/s:     (was: 1.10.0)
                   1.11.0
                   1.10.1

> ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-14685
>                 URL: https://issues.apache.org/jira/browse/FLINK-14685
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zili Chen
>            Priority: Major
>             Fix For: 1.10.1, 1.11.0
>
>
> Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. connection loss, it will set the state as invalid so that all checkpoint id counter operations succeed will fail.
> Although couple with JM leadership management we will generate a new id counter on re-granted leadership so that it is not a problem so far, the semantic is wrong because id counter should only check whether current state is SUSPENDED/LOST. 
> It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in {{LeaderLatch}}. [~lamber-ken] provides a [fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299] there.
> Besides, in product scenario we once noticed that JM didn't re-elected(it shouldn't happen after [~trohrmann] add linearized leader operation) on SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
> I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue and fix this wrong semantic.
> CC [~GJL] [~azagrebin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)