You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@zookeeper.apache.org by "Jie Huang (Jira)" <ji...@apache.org> on 2020/05/10 18:14:00 UTC

[jira] [Updated] (ZOOKEEPER-3816) Improve the lagging detection between the leader and learners

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Huang updated ZOOKEEPER-3816:
---------------------------------
    Labels: pull-request-available  (was: )

> Improve the lagging detection between the leader and learners 
> --------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3816
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3816
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Jie Huang
>            Assignee: Jie Huang
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.6.2
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we have SyncLimitCheck on the leader to detect a lagging leaner by tracking the time a proposal being acknowledged. If the leader doesn't receive the ack for a proposal from a learner within the syncLimit, it disconnects the learner. 
> The purpose of the SyncLimitCheck is to prevent sessions connected to a slow learner from being expired.  By disconnecting the slow learner, it gives the clients a chance to re-connect to another server before session expiration. 
> However, there are two cases that the sessions can still expire with current SyncLimitCheck implementation. 
> One case is that the ack reaches the leader on time but a ping response including the session table is delayed. The lagging detection is based on the proposal/ack time yet the sessions are updated when the ping response is received. If the ping response is delayed longer than the ack, the sessions could expire without lagging being detected. It makes more sense to detect lagging based on ping/ping response time. 
> Another case is that the leader detects lagging and closes the connection to the slower learner but the learner doesn't know that it is being disconnected due to long socket closing time or a lost RST signal. So the learner doesn't disconnect its clients, who lose their chance to re-connect to anther server before session expiration. The learner, like the leader, also needs a means to detect communication issues at a higher-than-socket layer.
> So we need a lagging detector based on ping/ping response and bi-directional between the leader and the learners. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)