You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2022/08/17 13:40:00 UTC

[jira] [Updated] (NIFI-10362) Cluster can disconnect node as soon as it rejoins cluster upon restart

     [ https://issues.apache.org/jira/browse/NIFI-10362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Payne updated NIFI-10362:
------------------------------
    Fix Version/s: 1.18.0
           Status: Patch Available  (was: Open)

> Cluster can disconnect node as soon as it rejoins cluster upon restart
> ----------------------------------------------------------------------
>
>                 Key: NIFI-10362
>                 URL: https://issues.apache.org/jira/browse/NIFI-10362
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.18.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When the Cluster Coordinator disconnects a node due to a user requesting that the node get disconnected, the node is immediately marked as DISCONNECTED, and then a background thread is responsible for notifying the node that it's been disconnected. The background task attempts several times if it cannot successfully send the notification.
> However, if the node is disconnected and then restarted before it's been notified, we have a situation in which the node becomes CONNECTING (and possibly then CONNECTED), and then the background task is triggered. This then results in the node being told that it's DISCONNECTED. But the Cluster Coordinator doesn't think so (because its already changed the state back to CONNECTING/CONNECTED).
> While the chances that this happens are slim in production and it's easily worked around (by simply waiting a few seconds after disconnecting a node before restarting it, or just restarting without disconnecting) it causes a lot of problems for system tests and potentially other automated activities.
> It results in the following log message in the Cluster Coordinator:
> {code:java}
> 2022-08-15 00:47:50,200 ERROR [Disconnect localhost:5672] org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator Failed to notify localhost:5672 that it has been disconnected from the cluster due to User anonymous requested that node be disconnected from cluster {code}
> And then we see confusing error messages such as:
> {code:java}
> 2022-08-15 00:48:01,461 INFO [Replicate Request Thread-23] org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator Received a status of 200 from localhost:5672 for request PUT /nifi-api/flow/process-groups/root when performing first stage of two-stage commit. The action will not occur. Node explanation: {"id":"root","state":"STOPPED"} {code}
> This is because when the cluster coordinator replicates the request to all nodes, the node that thinks it is disconnected receives the request and performs the action. It then responds with a "200 OK" but it should have noted that it's the first phase of a 2-phase action and responded with "201 Continue".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)