You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iotdb.apache.org by Eric Pai <Er...@hotmail.com> on 2021/08/29 09:36:15 UTC

Request for review: [IOTDB-1564]: Make leader failure detection and election faster

Hi, Xiangdong and Xinyu,

The PR https://github.com/apache/iotdb/pull/3797 for JIRA https://issues.apache.org/jira/browse/IOTDB-1564 is ready for review.
Please give some suggestions to those codes~. 

Thanks.

-----邮件原件-----
发件人: Xiangdong Huang <sa...@gmail.com> 
发送时间: 2021年8月25日 12:02
收件人: dev <de...@iotdb.apache.org>
主题: Re: 回复: Conclusion about JIRA issue[IOTDB-1564]: Make leader failure detection and election faster

Hi,

 current codes are:

```
long electionWait =
    ClusterConstant.getElectionLeastTimeOutMs()
        + Math.abs(random.nextLong() %
ClusterConstant.getElectionRandomTimeOutMs());
```

where the comment says: electionLeastTimeOutMs should be at least as long as a heartbeat;

IMO,  these two parameters are enough, and we do not need to add more parameters.

But the default value can be changed:
1. electionLeastTimeOutMs can be heartbeat *2 or something others, rather than 2 seconds by default.
2. by default, electionRandomTimeOutMs can be 50 ms or something like
heartbeat/10  ?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院

Eric Pai <er...@hotmail.com> 于2021年8月23日周一 上午10:18写道:
>
> Hi, Xiangdong,
>
> So what your suggestions about the election waiting time? Add another configuration parameter called election_wait_time_ms, or left as a shorter hardcode constant?
>
> 发件人: Eric Pai <Er...@hotmail.com>
> 日期: 2021年8月21日 星期六 下午7:32
> 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> 主题: 回复: Conclusion about JIRA issue[IOTDB-1564]: Make leader failure 
> detection and election faster
>
> Hi, all,
>
> Now the randomElectionWait time is hardcode as 3-5s, which is not suitable when the heartbeat_interval_ms and election_timeout_ms is too small.
>
> I decide to change it to [2* heartbeat_interval_ms, 2* heartbeat_interval_ms + 50ms).
>
> The 50ms is referred from the Raft paper with a low probability and fast election when split votes happens.
>
> But I haven’t found any detailed descriptions about the relationship between heartbeat_interval_ms and the least waiting time.
>
> Any good suggestions?
>
> 发件人: 白 渐
> 发送时间: 2021年8月18日 22:14
> 收件人: dev@iotdb.apache.org
> 主题: Conclusion about JIRA issue[IOTDB-1564]: Make leader failure 
> detection and election faster
>
> Hi, all,
>
> @Xinyu Tan and me have made a conclusion about the refine of hearbeat and election related timeout parameters:
>
> JIRA link: 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> ues.apache.org%2Fjira%2Fbrowse%2FIOTDB-1564&amp;data=04%7C01%7C%7C9782
> 3463d4104095d18608d9677d1fd9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C
> 0%7C637654609373686618%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XxyiqSz7m
> KozmmG4E85jShds9D63H5vEVMfYExv4Sag%3D&amp;reserved=0
>
> Two parameters are added:
>
> heartbeat_interval_ms (t1): The time interval(ms) between two rounds of heartbeat broadcast of one raft group leader.
>
> election_timeout_ms (t2 and t3): The election timeout time of candidates and followers, or as the parameter of waiting for voting result.
>
>                        t1             t1
> Leader view: Send HB - - -> Send HB - - -> Send HB
>                                                 t2                                     t3
> Follower view: Receive HB - - -> Receive HB - - - - -> HB expired / 
> Start election - - - - -> Election Timeout
>
> I will do the following works sooner or later:
>
> 1.     Coding.
>
> 2.     Proper test cases.
>
> 3.     Docs about new parameters.
>
> Thanks.
>
>

Re: Request for review: [IOTDB-1564]: Make leader failure detection and election faster

Posted by Xiangdong Huang <sa...@gmail.com>.
Hi,

Just one question.
any side effect after tuning heartbeat_interval from 1 second to 100
ms? e.g., CPU utilization.

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院

Eric Pai <Er...@hotmail.com> 于2021年8月29日周日 下午5:36写道:
>
> Hi, Xiangdong and Xinyu,
>
> The PR https://github.com/apache/iotdb/pull/3797 for JIRA https://issues.apache.org/jira/browse/IOTDB-1564 is ready for review.
> Please give some suggestions to those codes~.
>
> Thanks.
>
> -----邮件原件-----
> 发件人: Xiangdong Huang <sa...@gmail.com>
> 发送时间: 2021年8月25日 12:02
> 收件人: dev <de...@iotdb.apache.org>
> 主题: Re: 回复: Conclusion about JIRA issue[IOTDB-1564]: Make leader failure detection and election faster
>
> Hi,
>
>  current codes are:
>
> ```
> long electionWait =
>     ClusterConstant.getElectionLeastTimeOutMs()
>         + Math.abs(random.nextLong() %
> ClusterConstant.getElectionRandomTimeOutMs());
> ```
>
> where the comment says: electionLeastTimeOutMs should be at least as long as a heartbeat;
>
> IMO,  these two parameters are enough, and we do not need to add more parameters.
>
> But the default value can be changed:
> 1. electionLeastTimeOutMs can be heartbeat *2 or something others, rather than 2 seconds by default.
> 2. by default, electionRandomTimeOutMs can be 50 ms or something like
> heartbeat/10  ?
>
> Best,
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
>
>  黄向东
> 清华大学 软件学院
>
> Eric Pai <er...@hotmail.com> 于2021年8月23日周一 上午10:18写道:
> >
> > Hi, Xiangdong,
> >
> > So what your suggestions about the election waiting time? Add another configuration parameter called election_wait_time_ms, or left as a shorter hardcode constant?
> >
> > 发件人: Eric Pai <Er...@hotmail.com>
> > 日期: 2021年8月21日 星期六 下午7:32
> > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > 主题: 回复: Conclusion about JIRA issue[IOTDB-1564]: Make leader failure
> > detection and election faster
> >
> > Hi, all,
> >
> > Now the randomElectionWait time is hardcode as 3-5s, which is not suitable when the heartbeat_interval_ms and election_timeout_ms is too small.
> >
> > I decide to change it to [2* heartbeat_interval_ms, 2* heartbeat_interval_ms + 50ms).
> >
> > The 50ms is referred from the Raft paper with a low probability and fast election when split votes happens.
> >
> > But I haven’t found any detailed descriptions about the relationship between heartbeat_interval_ms and the least waiting time.
> >
> > Any good suggestions?
> >
> > 发件人: 白 渐
> > 发送时间: 2021年8月18日 22:14
> > 收件人: dev@iotdb.apache.org
> > 主题: Conclusion about JIRA issue[IOTDB-1564]: Make leader failure
> > detection and election faster
> >
> > Hi, all,
> >
> > @Xinyu Tan and me have made a conclusion about the refine of hearbeat and election related timeout parameters:
> >
> > JIRA link:
> > https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> > ues.apache.org%2Fjira%2Fbrowse%2FIOTDB-1564&amp;data=04%7C01%7C%7C9782
> > 3463d4104095d18608d9677d1fd9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C
> > 0%7C637654609373686618%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XxyiqSz7m
> > KozmmG4E85jShds9D63H5vEVMfYExv4Sag%3D&amp;reserved=0
> >
> > Two parameters are added:
> >
> > heartbeat_interval_ms (t1): The time interval(ms) between two rounds of heartbeat broadcast of one raft group leader.
> >
> > election_timeout_ms (t2 and t3): The election timeout time of candidates and followers, or as the parameter of waiting for voting result.
> >
> >                        t1             t1
> > Leader view: Send HB - - -> Send HB - - -> Send HB
> >                                                 t2                                     t3
> > Follower view: Receive HB - - -> Receive HB - - - - -> HB expired /
> > Start election - - - - -> Election Timeout
> >
> > I will do the following works sooner or later:
> >
> > 1.     Coding.
> >
> > 2.     Proper test cases.
> >
> > 3.     Docs about new parameters.
> >
> > Thanks.
> >
> >