You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@iotdb.apache.org by "xinzhongtianxia (Jira)" <ji...@apache.org> on 2022/06/24 09:28:00 UTC

[jira] [Created] (IOTDB-3646) increase retry interval and times when confignode joining cluster

xinzhongtianxia created IOTDB-3646:
--------------------------------------

Summary: increase retry interval and times when confignode joining cluster
Key: IOTDB-3646
URL: https://issues.apache.org/jira/browse/IOTDB-3646
Project: Apache IoTDB
Issue Type: Improvement
Reporter: xinzhongtianxia

The currently confignode register process is as follows:
1. send register req to seed nodes, which will return immediately with a success.
2. open consensus service port waiting for call back from seed nodes.
3. seed nodes make a callback with some consensus group information, if failed, it will make some reties.

It works well, but not in kubernetes.

In Kubernetes, confignodes make communication via FQDN, witch will be resolved to the pod's IP by kube-dns or CoreDNS.
Unfortunately, we cannot resolve pod's dns immediately after its consensus port started, because there is a delay (for dns cache) of serval seconds which depends on the configuration of the DNS service, e.g. CoreDns, the default value is 30.
In most scenario, we may be just an lessee and have no permissions to change the configuration of DNS services.

Current max retry times is 5 with an interval of 500ms. It is not enough.

When all retries failed, the confignode will never be able to join the consensus group and will be running without any ability to handle req from DataNode.

So, we need to increase the retry times and interval, e.g. 15 times with an interval of 2s, to make it more robust.

--
This message was sent by Atlassian Jira
(v8.20.7#820007)