You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2021/02/03 20:51:01 UTC

[jira] [Created] (NIFI-8196) When a node is disconnected due to failing to service a request, upon cluster reconnection it may not participate in leader election

Mark Payne created NIFI-8196:
--------------------------------

             Summary: When a node is disconnected due to failing to service a request, upon cluster reconnection it may not participate in leader election
                 Key: NIFI-8196
                 URL: https://issues.apache.org/jira/browse/NIFI-8196
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
            Reporter: Mark Payne
            Assignee: Mark Payne


NIFI-7920 fixed a bug that can result in nodes getting the wrong Revision for some components. The fix for that, however, appears to have caused a regression. When a Node is disconnected due to failing to service a replicated API request, such as a component being stopped/started/moved, it will now unregister from leader election for Primary Node / Cluster Coordinator. However, if it then reconnects, it does not re-register for the roles. As a result, we can have a situation where a node disconnects and reconnects and never is able to become Cluster Coordinator. If this happens to all nodes in a cluster, we can end up where no nodes are eligible to become Cluster Coordinator. This results in logs such as:
{code:java}
2021-02-03 20:14:55,167 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: java.lang.IllegalArgumentException: Cannot send heartbeat to address []. Address must be in <hostname>:<port> format {code}
And errors in the UI stating:
{code:java}
Action cannot be performed because there is currently no Cluster Coordinator elected. The request should be tried again after a moment, after a Cluster Coordinator has been automatically elected.. Returning Service Unavailable response. {code}
At this point, there will never be a cluster coordinator until nodes are restarted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)