You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@iotdb.apache.org by "Tian Jiang (Jira)" <ji...@apache.org> on 2020/10/22 02:54:00 UTC

[jira] [Created] (IOTDB-953) [Distributed] Improve handling when a node cannot be reached

Tian Jiang created IOTDB-953:
--------------------------------

Summary: [Distributed] Improve handling when a node cannot be reached
Key: IOTDB-953
URL: https://issues.apache.org/jira/browse/IOTDB-953
Project: Apache IoTDB
Issue Type: Improvement
Components: Core/Cluster
Reporter: Tian Jiang
Assignee: Houliang Qi

When a node fails to send a request to another node, it will record one failure in its ClientPool, and when the count of failures reaches 3, it will reject to give clients of that node for 60s.

This implementation has three main drawbacks:
1. It does not distinguish network connection errors from others. Once `onError()` of a client is called, the count of failures increases, even if it is not called due to a network failure.
2. Heartbeats should not be affected by this mechanism. As one functionality of heartbeats is to detect if one node is still alive, and they also need clients to do so, if they are blocked by the mechanism, we will lose the chance to resume connection with another node earlier, and the result would be we must wait for 60s even if the node has already resumed.
3. Heartbeat successes will not unblock other requests. Because we are using a separate pool for heartbeats when a heartbeat to a node succeeds, it only unblocks other heartbeats to this node, and other requests are still blocked for 60s because they are using another pool for clients.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)