You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Joe Witt (Jira)" <ji...@apache.org> on 2022/12/08 20:27:00 UTC

[jira] [Commented] (NIFI-9559) Zookeeper Client Can't Reconnect - Session timeout has elapsed while SUSPENDED

    [ https://issues.apache.org/jira/browse/NIFI-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644967#comment-17644967 ] 

Joe Witt commented on NIFI-9559:
--------------------------------

Anyone news to the issue https://issues.apache.org/jira/browse/NIFI-9559
We have also an 3-node Nifi Cluster (1.19.0) on AWS ECS Fargate and external Zookeeper (3.8.0 - 3 ZK nodes) and have the same issue.
On high load leading node lost connection to cluster and not correctly reconnecting. (edited) 



the load was high on the ZK node or high on the NiFI node?


Only in the Nifi node. So I start the process on Nifi and after 20-30minutes connection breaks down.
But memory and cpu on the node is okay.
So task on ECS is still running and I can access the node via direkt IP. In Cluster Management it is displayed as disconnected. (edited) 


and once it breaks down it is unable to restore?


right, task is running. And when I restart the task it breaks down and can not restore.


yeesh that is brutal.  How do you get it back in?


not really back in a working state. Shutting down the node and delete flow.xml.gz and I can start the node and it reconnects to the cluster.
But the whole canvas is lost. So at the moment I have no procedure for a recover


Why would you have to delete the flow.xml.gz and/or why wouldn't that node rejoin the cluster and inherit the flow...  ?


so after connection loss to the cluster that happens in a toggling way for all nodes.
Node 1 is leader => process load => after a few minutes it loses connection to cluster
Node 2 becomes leader => after a few minutes it loses connection


so restarting the node with untouched flow.xml.gz leads to a not starting task


I could an error message in Flow Initialization

can you share/attach those logs?


and this desc in the jira

yes, I will do. But I have to do it tomorrow :smile:


but thanks for your time and questions

 Hi Joe, I want to update you. The problem is solved. Main issue was throttled troughput mode on AWS EFS. We are using EFS as storage for the data of nifi which has to persist (state, content_repository, flow.file, database_repository and so on) Here it was wrong configured as bursting and limit was reached very fast in time of processing. So because of throttling node lost connection to cluster. And then there was a ping pong because every node uses the same efs filesystem (but different folder).

> Zookeeper Client Can't Reconnect - Session timeout has elapsed while SUSPENDED
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-9559
>                 URL: https://issues.apache.org/jira/browse/NIFI-9559
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Shawn Weeks
>            Assignee: Matt Burgess
>            Priority: Minor
>         Attachments: nifi_and_zookeeper_logs.txt, nifi_error.log
>
>
> After a loss of connection to Zookeeper a NiFi node never successfully reconnects to the Zookeeper or the Cluster and instead returns errors about no Cluster Coordinator and a Session timeout has elapsed while SUSPENDED repeatedly until you restart NiFi.
> The error described is the same one at https://issues.apache.org/jira/browse/CURATOR-405 however that patch has been in NiFi for several versions now.
> NiFi version is 1.15.3 and Zookeeper 3.6.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)