You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "lamber-ken (JIRA)" <ji...@apache.org> on 2019/07/10 09:43:00 UTC
[jira] [Updated] (FLINK-13189) Fix the impact of zookeeper network disconnect temporarily on flink long running jobs

     [ https://issues.apache.org/jira/browse/FLINK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

lamber-ken updated FLINK-13189:
-------------------------------
    Description: 
*Issue detail info*

We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network was disconnected temporarily between jobmanager and zookeeper.

So we analyze this problem deeply. Flink JobManager use curator's `+LeaderLatch+` to maintain the leadership. When network disconncet, the `+LeaderLatch+` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.

 

*Fix this issue*

From curator official website, we found that this issuse was fixed at curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. Based on the above considerations, we update `LeaderLatch` at flink-shaded-curator module.

 

*Other*

Any suggestions are webcome, thanks

 

*Useful links*

[https://curator.apache.org/zk-compatibility.html] 
 [https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
 [http://curator.apache.org/curator-recipes/leader-latch.html]

  

  was:
*Issue detail info*

We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network was disconnected temporarily between jobmanager and zookeeper.

So we analyze this problem deeply. Flink JobManager use curator's `+LeaderLatch+` to maintain the leadership. When network disconncet, the `+LeaderLatch+` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.

 

*Fix this issue*

From curator official website, we found that this issuse was fixed at curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. Based on the above considerations, we update `LeaderLatch` at flink-shaded-curator module.

 

*Useful links*

[https://curator.apache.org/zk-compatibility.html] 
[https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
[http://curator.apache.org/curator-recipes/leader-latch.html]

  


> Fix the impact of zookeeper network disconnect temporarily on flink long running jobs
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-13189
>                 URL: https://issues.apache.org/jira/browse/FLINK-13189
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.8.1
>            Reporter: lamber-ken
>            Assignee: lamber-ken
>            Priority: Major
>             Fix For: 1.9.0
>
>
> *Issue detail info*
> We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network was disconnected temporarily between jobmanager and zookeeper.
> So we analyze this problem deeply. Flink JobManager use curator's `+LeaderLatch+` to maintain the leadership. When network disconncet, the `+LeaderLatch+` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.
>  
> *Fix this issue*
> From curator official website, we found that this issuse was fixed at curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. Based on the above considerations, we update `LeaderLatch` at flink-shaded-curator module.
>  
> *Other*
> Any suggestions are webcome, thanks
>  
> *Useful links*
> [https://curator.apache.org/zk-compatibility.html] 
>  [https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
>  [http://curator.apache.org/curator-recipes/leader-latch.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)