You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/02/05 20:21:01 UTC

[jira] [Commented] (NIFI-8204) When Cluster Coordinator dies suddenly, is possible for Component Revisions to be inconsistent across nodes in cluster

    [ https://issues.apache.org/jira/browse/NIFI-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279973#comment-17279973 ] 

ASF subversion and git services commented on NIFI-8204:
-------------------------------------------------------

Commit 749d05840ba88efc8b42f5434d9223104edfab68 in nifi's branch refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=749d058 ]

NIFI-8204, NIFI-7866: Send revision update count in heartbeats. If update count in heartbeat is greater than that of cluster coordinator, request that node reconnect to get most up-to-date revisions. Cannot check exact equality, as the values may change between the time a heartbeat is created and the time the cluster coordinator receives it. However, it should be safe to assume that the revision won't be greater than that of the cluster coordinator. There is a tiny window in which it could be, as the sending node may update its revision, create the heartbeat, send it, and cluster coordinator process it before updating its own revision. However, this window is incredibly small and would only result in the sending node reconnecting, which will resolve itself. Also, when testing this fix, encountered NIFI-7866 and addressed that NullPointerException.

This closes #4806.

Signed-off-by: Bryan Bende <bb...@apache.org>


> When Cluster Coordinator dies suddenly, is possible for Component Revisions to be inconsistent across nodes in cluster
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-8204
>                 URL: https://issues.apache.org/jira/browse/NIFI-8204
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Critical
>             Fix For: 1.13.0
>
>
> I encountered a scenario in a 2-node cluster where Node 0 was the Cluster Coordinator. It suddenly died and was restarted by the RunNiFi process. The restart occurred more quickly than the zookeeper session timeout. Once the node was rejoined to the cluster, I started to see errors when attempting to modify a component that "Node xyz is unable to fulfill this request due to  [0, null, <uuid>] is not the most up-to-date revision. This component appears to have been modified."
> Refreshing the browser did not help. This indicates that nodes in the cluster have different component revisions.
> After looking through logs, here is the series of events that led to this situation:
>  
> Node 0 restarts but is still Cluster Coordinator. Has topology showing all nodes disconnected, all revisions empty.
> Node 1 heartbeats to Node 0. Node 0 responds saying: Your cluster topology is wrong. node-1 should be DISCONNECTED due to Has Not Yet Connected.
> Node 1 updates topology as directed
> Node 1 becomes cluster coordinator because Node 0 hasn't yet connected and its ZooKeeper session times out
> Node 1 receives heartbeat from itself
> Node 1 determines that it hasn't yet connected (based on topology received from Node 0) so issues reconnection request.
> Node 1 changes state of Node 1 from DISCONNECTED to CONNECTING. Notifies Node 0 of the topology update.
> Node 1 relinquishes role as cluster coordinator
> Node 1 requests (to itself) to join cluster
> Node 1 receives ConnectionResponse (from itself) that includes a collection of 79 revisions
> Node 0 finishes startup. Has set of empty revisions.
> Node 0 becomes cluster coordinator
> Node 1 sends heartbeat to Node 0
> Node 0 marks Node 1 as Connected to Cluster
>  
> We should address this by keeping track of the number of updates to the Revision Manager and sending this in Heartbeat messages. When the Cluster Coordinator receives a heartbeat, it should compare the update count to its own internal update count. If the heartbeat's update count is higher, it should request that the sending node reconnect to the cluster. This will ensure that if this situation were to arise again, the node would reconnect and get the most up-to-date set of revisions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)