You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@iotdb.apache.org by "Eric Pai (Jira)" <ji...@apache.org> on 2021/08/12 13:32:00 UTC

[jira] [Created] (IOTDB-1564) Make leader failure detection and election faster

Eric Pai created IOTDB-1564:
-------------------------------

             Summary: Make leader failure detection and election faster
                 Key: IOTDB-1564
                 URL: https://issues.apache.org/jira/browse/IOTDB-1564
             Project: Apache IoTDB
          Issue Type: Improvement
          Components: Cluster
            Reporter: Eric Pai
            Assignee: Eric Pai
             Fix For: master branch


The cluster configuration _connection_timeout_ms_ is now used in different layers:

1. The connection and socket timeout of underlying TSocket of the Thrift framework.
2. The CatchUpTask.
*3. The heartbeat expired time of RaftMember.*
*4. The sleep time between adjcent FOLLOWER heartbeat timeout validations.*
*5. The election timeout.*

However, it doesn't make sense that those time must be same. A longer heartbeat expired time means a delayed detection of the leader failure. Thus we should separate it as a new configuration parameter for DBAs.

Except for the network latency, +4+ and +5+ are the major impactions of the time cost from leader failure to a successful election ends. We can do some optimizations for them. Here are my solutions:

a) Add new cluster configurations: _heartbeat_timeout_ms_ and _election_timeout_ms_, and leave _connection_timeout_ms_ as the TSocket timeout only (this also satisfies the literal meaning).
 * _heartbeat_timeout_ms_: The max expired time from lastHeartbeatReceivedTime. If current time exceeds heartbeat_timeout_ms + lastHeartbeatReceivedTime, a new election starts.
 * _election_timeout_ms_: The max time waiting for the vote response.

b) We can also make +4+ process more wizardly.

Because we already know the _lastHeartbeatReceivedTime_, then the expired time of this heartbeat is _lastHeartbeatReceivedTime_ + _heartbeat_timeout_ms_.

Thus we can sleep _lastHeartbeatReceivedTime_ + _heartbeat_timeout_ms_ - _now()_ for the next check. If the heartbeat timeout happens, RaftMember will start election immediately.

c) If multi RaftMembers start election at the sametime, all elections may fail because of receiving insufficent votes, and every member will wait a random time for next election, which increases the whole election duration. In order to improve the election successful rate, we can add a random delayed time to +b+:
- Sleep: _lastHeartbeatReceivedTime + heartbeat_timeout_ms - now() + randomTime()_

The less value _randomTime()_ a member gets, the higher probability it has to become the new leader.

This design has another benfit. We know that the leader node has heavier workload than followers. In the future we can make the idlest node be the leader during election, by calculating the resource usage and returing a lower value from the randomTime() method in the idle node.

Any suggestions or disscussions are welcomed :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)