You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Shawn Weeks <sw...@weeksconsulting.us> on 2022/01/10 18:43:57 UTC

RE: NiFi Losing Zookeeper Connection

I think I beat you to it. Feel free to mark whichever a dupelicate.

https://issues.apache.org/jira/browse/NIFI-9559

Thanks
Shawn

From: Nathan Gough <th...@apache.org>
Sent: Monday, January 10, 2022 12:37 PM
To: users@nifi.apache.org
Subject: Re: NiFi Losing Zookeeper Connection

Hi Shawn,

Thanks for reporting the issue. I'll see if I can reproduce it and figure out why it's occurring/what we can do to fix it. I created a Jira issue here: https://issues.apache.org/jira/browse/NIFI-9560.

Nathan

On Mon, Jan 10, 2022 at 1:24 PM Shawn Weeks <sw...@weeksconsulting.us>> wrote:
In my case the IP of Zookeeper is not changing though…

Thanks
Shawn

From: Shawn Weeks <sw...@weeksconsulting.us>>
Sent: Monday, January 10, 2022 12:23 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: NiFi Losing Zookeeper Connection

Of note someone on Stack overflow is having this issue with the current version of the curator framework. This sounds like the same issue.

https://stackoverflow.com/questions/68215630/why-isnt-curator-recovering-when-zookeeper-is-back-online

Thanks
Shawn

From: Shawn Weeks <sw...@weeksconsulting.us>>
Sent: Monday, January 10, 2022 12:12 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: NiFi Losing Zookeeper Connection

I’ve been dealing with a Zookeeper connection issue on NiFi 1.14 for a while now and I was wondering if anyone had any ideas. Basic issue is a NiFi node will lose its connection to Zookeeper due to network interruptions and then it’s never able to get its connection back. Logs look like it’s retrying over and over but I suspect it’s not and it’s stuck in this mode where the connection is gone but it’s never going to reconnect. Only way to resolve the issue is to restart NiFi. Exception in the logs starts around 2022-01-10 17:20:55,919 and I’ve cross referenced it with some zookeeper logs at the same time. All three zookeeper logs show a similar error about this box. In this example 192.168.1.212 is the IP for the NiFi instance called nifi0592.example.org<http://nifi0592.example.org>. This is running in AWS and I’ve reviewed flow logs for REJECT or firewall blocks but nothing. We’re on Zookeeper 3.6.3 and I’m seeing this across multiple NiFi instances and VPCs. I’ve found mentions of the suspended in a zookeeper ticket but the client version that fixed it has been in NiFi for several versions now.

Thanks
Shawn

# NiFi Log
2022-01-10 17:19:57,464 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@718198db<ma...@718198db> checkpointed with 2951 Records and 0 Swap Files in 19 milliseconds (Stop-the-world time = 11 milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1224814
2022-01-10 17:19:57,781 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:19:57,927 INFO [Timer-Driven Process Thread-13] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:20:05,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session timeout ms: 10000
2022-01-10 17:20:12,884 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:20:15,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:16,992 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2022-01-10 17:20:16,992 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds
2022-01-10 17:20:19,438 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=9e444aad-017c-1000-ffff-ffffe0ebbb57,name=Service - Invoke EBM Workflow] is not the most recent version of the flow that is under Version Control; current version is 1; most recent version is 2
2022-01-10 17:20:19,849 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=f9b53979-9eac-1ed5-a8c0-446e5b758cd4,name=Monitor Inbound SFTP] is not the most recent version of the flow that is under Version Control; current version is 1; most recent version is 3
2022-01-10 17:20:19,866 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=e36d47d2-9a3a-1a20-0000-00002bc9db2d,name=New EBM Parent] is not the most recent version of the flow that is under Version Control; current version is 5; most recent version is 6
2022-01-10 17:20:24,142 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
2022-01-10 17:20:25,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:27,944 INFO [Timer-Driven Process Thread-10] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:20:27,986 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:20:35,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session timeout ms: 10000
2022-01-10 17:20:36,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2022-01-10 17:20:36,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds
2022-01-10 17:20:43,086 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:20:45,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:55,229 INFO [NiFi Web Server-61-EventThread] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2022-01-10 17:20:55,682 INFO [NiFi Web Server-3434-EventThread] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2022-01-10 17:20:55,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:56,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2022-01-10 17:20:57,032 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds
2022-01-10 17:20:58,388 INFO [Timer-Driven Process Thread-59] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:21:04,546 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:21:05,416 INFO [NiFi Web Server-4693] o.a.n.c.m.e.NoConnectedNodesException Cluster failed processing request: org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node has yet been elected Cluster Coordinator. Cannot establish connection to cluster yet.. Returning Service Unavailable response.
2022-01-10 17:21:05,418 WARN [Http Site-to-Site PeerSelector] o.apache.nifi.remote.client.PeerSelector Could not communicate with nifi0592.example.org:8443<http://nifi0592.example.org:8443> to determine which node(s) exist in the remote NiFi instance, due to org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException: response code 503:Service Unavailable with explanation: null
2022-01-10 17:21:05,533 INFO [Http Site-to-Site PeerSelector] o.apache.nifi.remote.client.PeerSelector Successfully refreshed peer status cache; remote group consists of 2 peers
2022-01-10 17:21:05,920 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:21:15,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:21:17,032 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository

# Zookeeper Log
2022-01-10 17:20:54,455 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.1.212:51384<http://192.168.1.212:51384>, session = 0x3002a3c766c02b7
     at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
     at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
     at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
     at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-10 17:20:54,455 [myid:3] - WARN  [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.1.212:51380<http://192.168.1.212:51380>, session = 0x3002a3c766c02b6
     at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
     at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
     at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
     at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-10 17:20:55,045 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b7, timeout of 10000ms exceeded
2022-01-10 17:20:55,045 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b6, timeout of 10000ms exceeded
2022-01-10 17:20:55,045 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3002a3c766c02b7
2022-01-10 17:20:55,045 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3002a3c766c02b6
2022-01-10 17:20:55,910 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3002a3c766c02b8
2022-01-10 17:20:55,910 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3002a3c766c02b9

RE: NiFi Losing Zookeeper Connection

Posted by Shawn Weeks <sw...@weeksconsulting.us>.
Would that have caused a NiFi node to never be able to reconnect to Zookeeper? I even restarted the external Zookeeper cluster and whichever NiFi node was refusing to connect still wouldn’t connect until I restarted NiFi.

Thanks
Shawn

From: Edward Armes <ed...@gmail.com>
Sent: Monday, January 10, 2022 6:12 PM
To: users@nifi.apache.org
Subject: Re: NiFi Losing Zookeeper Connection


I've just had a quick look on ZooKeeper JIRA and it looks like there is a bug around ZooKeeper session closing that was introduced as part of a change in 3.6 release it looks like it might of been fixed in 3.7 release

https://issues.apache.org/jira/browse/ZOOKEEPER-3822
https://issues.apache.org/jira/browse/ZOOKEEPER-3706
https://issues.apache.org/jira/browse/ZOOKEEPER-3828

Edward

On Mon, Jan 10, 2022 at 6:44 PM Shawn Weeks <sw...@weeksconsulting.us>> wrote:
I think I beat you to it. Feel free to mark whichever a dupelicate.

https://issues.apache.org/jira/browse/NIFI-9559

Thanks
Shawn

From: Nathan Gough <th...@apache.org>>
Sent: Monday, January 10, 2022 12:37 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: Re: NiFi Losing Zookeeper Connection

Hi Shawn,

Thanks for reporting the issue. I'll see if I can reproduce it and figure out why it's occurring/what we can do to fix it. I created a Jira issue here: https://issues.apache.org/jira/browse/NIFI-9560.

Nathan

On Mon, Jan 10, 2022 at 1:24 PM Shawn Weeks <sw...@weeksconsulting.us>> wrote:
In my case the IP of Zookeeper is not changing though…

Thanks
Shawn

From: Shawn Weeks <sw...@weeksconsulting.us>>
Sent: Monday, January 10, 2022 12:23 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: NiFi Losing Zookeeper Connection

Of note someone on Stack overflow is having this issue with the current version of the curator framework. This sounds like the same issue.

https://stackoverflow.com/questions/68215630/why-isnt-curator-recovering-when-zookeeper-is-back-online

Thanks
Shawn

From: Shawn Weeks <sw...@weeksconsulting.us>>
Sent: Monday, January 10, 2022 12:12 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: NiFi Losing Zookeeper Connection

I’ve been dealing with a Zookeeper connection issue on NiFi 1.14 for a while now and I was wondering if anyone had any ideas. Basic issue is a NiFi node will lose its connection to Zookeeper due to network interruptions and then it’s never able to get its connection back. Logs look like it’s retrying over and over but I suspect it’s not and it’s stuck in this mode where the connection is gone but it’s never going to reconnect. Only way to resolve the issue is to restart NiFi. Exception in the logs starts around 2022-01-10 17:20:55,919 and I’ve cross referenced it with some zookeeper logs at the same time. All three zookeeper logs show a similar error about this box. In this example 192.168.1.212 is the IP for the NiFi instance called nifi0592.example.org<http://nifi0592.example.org>. This is running in AWS and I’ve reviewed flow logs for REJECT or firewall blocks but nothing. We’re on Zookeeper 3.6.3 and I’m seeing this across multiple NiFi instances and VPCs. I’ve found mentions of the suspended in a zookeeper ticket but the client version that fixed it has been in NiFi for several versions now.

Thanks
Shawn

# NiFi Log
2022-01-10 17:19:57,464 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@718198db<ma...@718198db> checkpointed with 2951 Records and 0 Swap Files in 19 milliseconds (Stop-the-world time = 11 milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1224814
2022-01-10 17:19:57,781 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:19:57,927 INFO [Timer-Driven Process Thread-13] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:20:05,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session timeout ms: 10000
2022-01-10 17:20:12,884 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:20:15,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:16,992 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2022-01-10 17:20:16,992 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds
2022-01-10 17:20:19,438 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=9e444aad-017c-1000-ffff-ffffe0ebbb57,name=Service - Invoke EBM Workflow] is not the most recent version of the flow that is under Version Control; current version is 1; most recent version is 2
2022-01-10 17:20:19,849 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=f9b53979-9eac-1ed5-a8c0-446e5b758cd4,name=Monitor Inbound SFTP] is not the most recent version of the flow that is under Version Control; current version is 1; most recent version is 3
2022-01-10 17:20:19,866 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=e36d47d2-9a3a-1a20-0000-00002bc9db2d,name=New EBM Parent] is not the most recent version of the flow that is under Version Control; current version is 5; most recent version is 6
2022-01-10 17:20:24,142 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
2022-01-10 17:20:25,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:27,944 INFO [Timer-Driven Process Thread-10] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:20:27,986 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:20:35,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session timeout ms: 10000
2022-01-10 17:20:36,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2022-01-10 17:20:36,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds
2022-01-10 17:20:43,086 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:20:45,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:55,229 INFO [NiFi Web Server-61-EventThread] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2022-01-10 17:20:55,682 INFO [NiFi Web Server-3434-EventThread] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2022-01-10 17:20:55,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:20:56,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2022-01-10 17:20:57,032 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds
2022-01-10 17:20:58,388 INFO [Timer-Driven Process Thread-59] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:21:04,546 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected
2022-01-10 17:21:05,416 INFO [NiFi Web Server-4693] o.a.n.c.m.e.NoConnectedNodesException Cluster failed processing request: org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node has yet been elected Cluster Coordinator. Cannot establish connection to cluster yet.. Returning Service Unavailable response.
2022-01-10 17:21:05,418 WARN [Http Site-to-Site PeerSelector] o.apache.nifi.remote.client.PeerSelector Could not communicate with nifi0592.example.org:8443<http://nifi0592.example.org:8443> to determine which node(s) exist in the remote NiFi instance, due to org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException: response code 503:Service Unavailable with explanation: null
2022-01-10 17:21:05,533 INFO [Http Site-to-Site PeerSelector] o.apache.nifi.remote.client.PeerSelector Successfully refreshed peer status cache; remote group consists of 2 peers
2022-01-10 17:21:05,920 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:21:15,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000
2022-01-10 17:21:17,032 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository

# Zookeeper Log
2022-01-10 17:20:54,455 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.1.212:51384<http://192.168.1.212:51384>, session = 0x3002a3c766c02b7
     at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
     at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
     at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
     at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-10 17:20:54,455 [myid:3] - WARN  [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.1.212:51380<http://192.168.1.212:51380>, session = 0x3002a3c766c02b6
     at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
     at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
     at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
     at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-10 17:20:55,045 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b7, timeout of 10000ms exceeded
2022-01-10 17:20:55,045 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b6, timeout of 10000ms exceeded
2022-01-10 17:20:55,045 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3002a3c766c02b7
2022-01-10 17:20:55,045 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3002a3c766c02b6
2022-01-10 17:20:55,910 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3002a3c766c02b8
2022-01-10 17:20:55,910 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3002a3c766c02b9

Re: NiFi Losing Zookeeper Connection

Posted by Edward Armes <ed...@gmail.com>.
I've just had a quick look on ZooKeeper JIRA and it looks like there is a
bug around ZooKeeper session closing that was introduced as part of a
change in 3.6 release it looks like it might of been fixed in 3.7 release

https://issues.apache.org/jira/browse/ZOOKEEPER-3822
https://issues.apache.org/jira/browse/ZOOKEEPER-3706
https://issues.apache.org/jira/browse/ZOOKEEPER-3828

Edward

On Mon, Jan 10, 2022 at 6:44 PM Shawn Weeks <sw...@weeksconsulting.us>
wrote:

> I think I beat you to it. Feel free to mark whichever a dupelicate.
>
>
>
> https://issues.apache.org/jira/browse/NIFI-9559
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From:* Nathan Gough <th...@apache.org>
> *Sent:* Monday, January 10, 2022 12:37 PM
> *To:* users@nifi.apache.org
> *Subject:* Re: NiFi Losing Zookeeper Connection
>
>
>
> Hi Shawn,
>
>
>
> Thanks for reporting the issue. I'll see if I can reproduce it and figure
> out why it's occurring/what we can do to fix it. I created a Jira issue
> here: https://issues.apache.org/jira/browse/NIFI-9560.
>
>
>
> Nathan
>
>
>
> On Mon, Jan 10, 2022 at 1:24 PM Shawn Weeks <sw...@weeksconsulting.us>
> wrote:
>
> In my case the IP of Zookeeper is not changing though…
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From:* Shawn Weeks <sw...@weeksconsulting.us>
> *Sent:* Monday, January 10, 2022 12:23 PM
> *To:* users@nifi.apache.org
> *Subject:* RE: NiFi Losing Zookeeper Connection
>
>
>
> Of note someone on Stack overflow is having this issue with the current
> version of the curator framework. This sounds like the same issue.
>
>
>
>
> https://stackoverflow.com/questions/68215630/why-isnt-curator-recovering-when-zookeeper-is-back-online
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From:* Shawn Weeks <sw...@weeksconsulting.us>
> *Sent:* Monday, January 10, 2022 12:12 PM
> *To:* users@nifi.apache.org
> *Subject:* NiFi Losing Zookeeper Connection
>
>
>
> I’ve been dealing with a Zookeeper connection issue on NiFi 1.14 for a
> while now and I was wondering if anyone had any ideas. Basic issue is a
> NiFi node will lose its connection to Zookeeper due to network
> interruptions and then it’s never able to get its connection back. Logs
> look like it’s retrying over and over but I suspect it’s not and it’s stuck
> in this mode where the connection is gone but it’s never going to
> reconnect. Only way to resolve the issue is to restart NiFi. Exception in
> the logs starts around 2022-01-10 17:20:55,919 and I’ve cross referenced it
> with some zookeeper logs at the same time. All three zookeeper logs show a
> similar error about this box. In this example 192.168.1.212 is the IP for
> the NiFi instance called nifi0592.example.org. This is running in AWS and
> I’ve reviewed flow logs for REJECT or firewall blocks but nothing. We’re on
> Zookeeper 3.6.3 and I’m seeing this across multiple NiFi instances and
> VPCs. I’ve found mentions of the suspended in a zookeeper ticket but the
> client version that fixed it has been in NiFi for several versions now.
>
>
>
> Thanks
>
> Shawn
>
>
>
> # NiFi Log
>
> 2022-01-10 17:19:57,464 INFO [Write-Ahead Local State Provider
> Maintenance] org.wali.MinimalLockingWriteAheadLog
> org.wali.MinimalLockingWriteAheadLog@718198db checkpointed with 2951
> Records and 0 Swap Files in 19 milliseconds (Stop-the-world time = 11
> milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1224814
>
> 2022-01-10 17:19:57,781 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:19:57,927 INFO [Timer-Driven Process Thread-13]
> o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow
> Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi];
> updated to reflect 2 Input Ports [InputPort[name=vantage_file_push,
> targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc],
> InputPort[name=incoming_bulletin,
> targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
>
> 2022-01-10 17:20:05,918 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:12,884 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:20:15,918 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:16,992 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
> 2022-01-10 17:20:16,992 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 98 records in 0 milliseconds
>
> 2022-01-10 17:20:19,438 INFO [Timer-Driven Process Thread-36]
> o.a.nifi.groups.StandardProcessGroup
> StandardProcessGroup[identifier=9e444aad-017c-1000-ffff-ffffe0ebbb57,name=Service
> - Invoke EBM Workflow] is not the most recent version of the flow that is
> under Version Control; current version is 1; most recent version is 2
>
> 2022-01-10 17:20:19,849 INFO [Timer-Driven Process Thread-36]
> o.a.nifi.groups.StandardProcessGroup
> StandardProcessGroup[identifier=f9b53979-9eac-1ed5-a8c0-446e5b758cd4,name=Monitor
> Inbound SFTP] is not the most recent version of the flow that is under
> Version Control; current version is 1; most recent version is 3
>
> 2022-01-10 17:20:19,866 INFO [Timer-Driven Process Thread-36]
> o.a.nifi.groups.StandardProcessGroup
> StandardProcessGroup[identifier=e36d47d2-9a3a-1a20-0000-00002bc9db2d,name=New
> EBM Parent] is not the most recent version of the flow that is under
> Version Control; current version is 5; most recent version is 6
>
> 2022-01-10 17:20:24,142 INFO [Cleanup Archive for default]
> o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0
> bytes) from archive
>
> 2022-01-10 17:20:25,918 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:27,944 INFO [Timer-Driven Process Thread-10]
> o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow
> Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi];
> updated to reflect 2 Input Ports [InputPort[name=vantage_file_push,
> targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc],
> InputPort[name=incoming_bulletin,
> targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
>
> 2022-01-10 17:20:27,986 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:20:35,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:36,993 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
> 2022-01-10 17:20:36,993 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 98 records in 0 milliseconds
>
> 2022-01-10 17:20:43,086 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:20:45,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:55,229 INFO [NiFi Web Server-61-EventThread]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
>
> 2022-01-10 17:20:55,682 INFO [NiFi Web Server-3434-EventThread]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
>
> 2022-01-10 17:20:55,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:56,993 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
> 2022-01-10 17:20:57,032 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 98 records in 0 milliseconds
>
> 2022-01-10 17:20:58,388 INFO [Timer-Driven Process Thread-59]
> o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow
> Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi];
> updated to reflect 2 Input Ports [InputPort[name=vantage_file_push,
> targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc],
> InputPort[name=incoming_bulletin,
> targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
>
> 2022-01-10 17:21:04,546 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:21:05,416 INFO [NiFi Web Server-4693]
> o.a.n.c.m.e.NoConnectedNodesException Cluster failed processing request:
> org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node
> has yet been elected Cluster Coordinator. Cannot establish connection to
> cluster yet.. Returning Service Unavailable response.
>
> 2022-01-10 17:21:05,418 WARN [Http Site-to-Site PeerSelector]
> o.apache.nifi.remote.client.PeerSelector Could not communicate with
> nifi0592.example.org:8443 to determine which node(s) exist in the remote
> NiFi instance, due to
> org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException:
> response code 503:Service Unavailable with explanation: null
>
> 2022-01-10 17:21:05,533 INFO [Http Site-to-Site PeerSelector]
> o.apache.nifi.remote.client.PeerSelector Successfully refreshed peer status
> cache; remote group consists of 2 peers
>
> 2022-01-10 17:21:05,920 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:21:15,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:21:17,032 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
>
>
> # Zookeeper Log
>
> 2022-01-10 17:20:54,455 [myid:3] - WARN
> [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception
>
> EndOfStreamException: Unable to read additional data from client, it
> probably closed the socket: address = /192.168.1.212:51384, session =
> 0x3002a3c766c02b7
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>
>      at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>
>      at java.base/java.lang.Thread.run(Thread.java:829)
>
> 2022-01-10 17:20:54,455 [myid:3] - WARN
> [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception
>
> EndOfStreamException: Unable to read additional data from client, it
> probably closed the socket: address = /192.168.1.212:51380, session =
> 0x3002a3c766c02b6
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>
>      at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>
>      at java.base/java.lang.Thread.run(Thread.java:829)
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b7,
> timeout of 10000ms exceeded
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b6,
> timeout of 10000ms exceeded
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> RequestThrottler:QuorumZooKeeperServer@159] - Submitting global
> closeSession request for session 0x3002a3c766c02b7
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> RequestThrottler:QuorumZooKeeperServer@159] - Submitting global
> closeSession request for session 0x3002a3c766c02b6
>
> 2022-01-10 17:20:55,910 [myid:3] - INFO
> [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session
> 0x3002a3c766c02b8
>
> 2022-01-10 17:20:55,910 [myid:3] - INFO
> [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session
> 0x3002a3c766c02b9
>
>