You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Joe Witt (Jira)" <ji...@apache.org> on 2021/02/03 20:55:00 UTC

[jira] [Updated] (NIFI-8196) When a node is disconnected due to failing to service a request, upon cluster reconnection it may not participate in leader election

     [ https://issues.apache.org/jira/browse/NIFI-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joe Witt updated NIFI-8196:
---------------------------
    Fix Version/s: 1.13.0

> When a node is disconnected due to failing to service a request, upon cluster reconnection it may not participate in leader election
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-8196
>                 URL: https://issues.apache.org/jira/browse/NIFI-8196
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Blocker
>             Fix For: 1.13.0
>
>
> NIFI-7920 fixed a bug that can result in nodes getting the wrong Revision for some components. The fix for that, however, appears to have caused a regression. When a Node is disconnected due to failing to service a replicated API request, such as a component being stopped/started/moved, it will now unregister from leader election for Primary Node / Cluster Coordinator. However, if it then reconnects, it does not re-register for the roles. As a result, we can have a situation where a node disconnects and reconnects and never is able to become Cluster Coordinator. If this happens to all nodes in a cluster, we can end up where no nodes are eligible to become Cluster Coordinator. This results in logs such as:
> {code:java}
> 2021-02-03 20:14:55,167 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: java.lang.IllegalArgumentException: Cannot send heartbeat to address []. Address must be in <hostname>:<port> format {code}
> And errors in the UI stating:
> {code:java}
> Action cannot be performed because there is currently no Cluster Coordinator elected. The request should be tried again after a moment, after a Cluster Coordinator has been automatically elected.. Returning Service Unavailable response. {code}
> At this point, there will never be a cluster coordinator until nodes are restarted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)