You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Axel Schwarz <Ax...@emailn.de> on 2021/08/12 07:55:41 UTC

Disfunctional cluster with version 1.13.2

Dear all,

after successfully battling the load balancing and installing Version 1.13.2 again in our 3 node production environment, we experienced another failure in the cluster resulting in a complete cut-off of the flow just 1,5h after the update.
We noticed it just by trying to access the webinterface, which immediately after login showed something like:

"Cannot replicate request to Node nifiHost1.contoso.com:8443 because the node is not connected"

There was nothing we could do through the webinterface aside from staring at this message and when looking at the live logs, there was nothing suspicious. The log moved on as if nothing happened.
After a restart of the cluster everything was working fine again, but we saw, that the entire flow wasn't working for some period of time. This alone is really uncool, as we running a cluster for exactly that reason: The flow should keep working, even if some node decides to malfunction for whatever reason.

Digging a little deeper into the logs showed two noticable problems:

1. The Zookeeper is restarting every few minutes. Which in the log always looks like this:

------------------------------------(nifiHost1.contoso.com) nifi-app.log------------------------------------------------------------------------------------

2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer shutting down
2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.FinalRequestProcessor shutdown of request processor complete
2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.SyncRequestProcessor Shutting down
2021-08-11 12:02:39,196 INFO [SyncThread:1] o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/nifi/logs/zookeeper/version-2 snapdir /opt/nifi/state/zookeeper/version-2
2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling watcher
java.lang.NullPointerException: null
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2021-08-11 12:02:39,277 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-08-11 12:02:40,001 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 Connection State changed to SUSPENDED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e Connection State changed to SUSPENDED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 Connection State changed to RECONNECTED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e Connection State changed to RECONNECTED
2021-08-11 12:02:40,014 INFO [main-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {}
2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11 12:02:40,127; send took 951 millis
2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling watcher
java.lang.NullPointerException: null
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)

------------------------------------------------------------------------------------------------------------------------------------------------------------

This happens, as mentioned, every few minutes and looks like this when condensed through grep:

------------------------------------(nifiHost1.contoso.com) nifi-app.log | grep "ZooKeeperServer Shutting down"---------------------------------------------

./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down

------------------------------------------------------------------------------------------------------------------------------------------------------------

So my first question is: Is this behaviour of zookeeper normal? Well it is just an INFO, but I would love to here that this is a completely normal phenomenon. :D

2. After all these zookeeper restarts, nifi finally throws the following error along with the "last" zookeeper restart:

------------------------------------(nifiHost1.contoso.com) nifi-app.log------------------------------------------------------------------------------------

2021-08-11 13:31:13,083 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer shutting down
2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.FinalRequestProcessor shutdown of request processor complete
2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.SyncRequestProcessor Shutting down
2021-08-11 13:31:13,440 INFO [SyncThread:1] o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/nifi/logs/zookeeper/version-2 snapdir /opt/nifi/state/zookeeper/version-2
2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5] o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer nifiHost2.contoso.com/192.168.1.11:19500
java.io.EOFException: Expected to receive a Transaction Completion Indicator from Peer nifiHost2.contoso.com but encountered EOF
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
        at org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
        at java.lang.Thread.run(Thread.java:748)
2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6] o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer nifiHost2.contoso.com/192.168.1.11:19502
java.io.EOFException: Expected to receive a Transaction Completion Indicator from Peer nifiHost2.contoso.com but encountered EOF
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
        at org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
        at java.lang.Thread.run(Thread.java:748)
2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1] o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443 changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443, state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST, length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Processing reconnection request from cluster coordinator.
2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role Primary Node disabled
2021-08-11 13:31:13,802 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer shutting down
2021-08-11 13:31:28,198 INFO [pool-22-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.FinalRequestProcessor shutdown of request processor complete
2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.SyncRequestProcessor Shutting down
2021-08-11 13:31:28,203 INFO [SyncThread:1] o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21] o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event Writer for Provenance Event Store Partition[directory=/opt/pr] due to MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
2021-08-11 13:31:28,208 INFO [pool-22-thread-1] o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log with 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time = 1 milliseconds), max Transaction ID 27772177643
2021-08-11 13:31:28,208 INFO [pool-22-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 10 records in 8 milliseconds
2021-08-11 13:31:28,295 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/nifi/logs/zookeeper/version-2 snapdir /opt/nifi/state/zookeeper/version-2
2021-08-11 13:31:32,291 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED

------------------------------------------------------------------------------------------------------------------------------------------------------------

After that, the log continues with normal messages about heartbeats and provenance events but no error occurs anymore. That exactly was the time, the whole flow cut-off.

I've got to add, that we've never experienced that flow cut-off behaviour with version 1.12.1, but the zookeeper restarts occur there as well and as often as seen above.
But when first updating to version 1.13.2 back in June, we experienced this flow cut-off multiple times beside the not working load balancing. Back then I thought this was connected to each other, but now we had a fully functional load balancing and the cut-off happened nevertheless.

I'm thankful for any hint. :)




Re: Disfunctional cluster with version 1.13.2

Posted by Ryan Hendrickson <ry...@gmail.com>.
Hi Mark,
   We just submitted this one:
https://issues.apache.org/jira/browse/NIFI-9056

   Another one will be on the way today/tomorrow for Relationship Load
Balancing issues.

Ryan

On Tue, Aug 17, 2021 at 10:17 AM Mark Payne <ma...@hotmail.com> wrote:

> Ryan,
>
> Have you filed Jiras for these? If so, can you share the links?
>
> Thanks
> -Mark
>
> On Aug 16, 2021, at 9:32 PM, Ryan Hendrickson <
> ryan.andrew.hendrickson@gmail.com> wrote:
>
> Axel,
>    We've had significant issues with 1.13.2 in a Cluster as well.  We're
> working on a test config... Issues range from abandoned FlowFiles, single
> Nodes locking the entire cluster, load balance relationships not working,
> and undocumented nifi properties.  We're reluctant to move to 1.14.0
> because we haven't seen anything specifically fixed in it.
>
>    Please keep the community up-to-date on your findings.
>
> Ryan
>
> On Mon, Aug 16, 2021 at 11:00 AM Pierre Villard <
> pierre.villard.fr@gmail.com> wrote:
>
>> Hi,
>>
>> What's the version of ZK?
>>
>> Thanks,
>> Pierre
>>
>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de> a
>> écrit :
>>
>>> Dear all,
>>>
>>> after successfully battling the load balancing and installing Version
>>> 1.13.2 again in our 3 node production environment, we experienced another
>>> failure in the cluster resulting in a complete cut-off of the flow just
>>> 1,5h after the update.
>>> We noticed it just by trying to access the webinterface, which
>>> immediately after login showed something like:
>>>
>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443
>>> <http://nifihost1.contoso.com:8443/> because the node is not connected"
>>>
>>> There was nothing we could do through the webinterface aside from
>>> staring at this message and when looking at the live logs, there was
>>> nothing suspicious. The log moved on as if nothing happened.
>>> After a restart of the cluster everything was working fine again, but we
>>> saw, that the entire flow wasn't working for some period of time. This
>>> alone is really uncool, as we running a cluster for exactly that reason:
>>> The flow should keep working, even if some node decides to malfunction for
>>> whatever reason.
>>>
>>> Digging a little deeper into the logs showed two noticable problems:
>>>
>>> 1. The Zookeeper is restarting every few minutes. Which in the log
>>> always looks like this:
>>>
>>> ------------------------------------(nifiHost1.contoso.com
>>> <http://nifihost1.contoso.com/>)
>>> nifi-app.log------------------------------------------------------------------------------------
>>>
>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
>>> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling
>>> watcher
>>> java.lang.NullPointerException: null
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>>> Connection State changed to SUSPENDED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>>> Connection State changed to SUSPENDED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>>> Connection State changed to RECONNECTED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>>> Connection State changed to RECONNECTED
>>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
>>> o.a.c.framework.imps.EnsembleTracker New config event received: {}
>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321
>>> <http://nifihost3.contoso.com:54321/> at 2021-08-11 12:02:40,127; send
>>> took 951 millis
>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
>>> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling
>>> watcher
>>> java.lang.NullPointerException: null
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> This happens, as mentioned, every few minutes and looks like this when
>>> condensed through grep:
>>>
>>> ------------------------------------(nifiHost1.contoso.com
>>> <http://nifihost1.contoso.com/>) nifi-app.log | grep "ZooKeeperServer
>>> Shutting down"---------------------------------------------
>>>
>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> So my first question is: Is this behaviour of zookeeper normal? Well it
>>> is just an INFO, but I would love to here that this is a completely normal
>>> phenomenon. :D
>>>
>>> 2. After all these zookeeper restarts, nifi finally throws the following
>>> error along with the "last" zookeeper restart:
>>>
>>> ------------------------------------(nifiHost1.contoso.com
>>> <http://nifihost1.contoso.com/>)
>>> nifi-app.log------------------------------------------------------------------------------------
>>>
>>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
>>> nifiHost2.contoso.com/192.168.1.11:19500
>>> <http://nifihost2.contoso.com/192.168.1.11:19500>
>>> java.io.EOFException: Expected to receive a Transaction Completion
>>> Indicator from Peer nifiHost2.contoso.com
>>> <http://nifihost2.contoso.com/> but encountered EOF
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>>>         at java.lang.Thread.run(Thread.java:748)
>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
>>> nifiHost2.contoso.com/192.168.1.11:19502
>>> <http://nifihost2.contoso.com/192.168.1.11:19502>
>>> java.io.EOFException: Expected to receive a Transaction Completion
>>> Indicator from Peer nifiHost2.contoso.com
>>> <http://nifihost2.contoso.com/> but encountered EOF
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>>>         at java.lang.Thread.run(Thread.java:748)
>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>>> o.a.n.c.c.node.NodeClusterCoordinator Status of
>>> nifiHost1.contoso.com:8443 <http://nifihost1.contoso.com:8443/> changed
>>> from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443
>>> <http://nifihost1.contoso.com:8443/>, state=CONNECTED, updateId=16] to
>>> NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443
>>> <http://nifihost1.contoso.com:8443/>, state=CONNECTING, updateId=16]
>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>>> length=2209040 bytes) from nifiHost1.contoso.com:8443
>>> <http://nifihost1.contoso.com:8443/> in 142 millis
>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>> o.a.nifi.controller.StandardFlowService Processing reconnection request
>>> from cluster coordinator.
>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role Primary
>>> Node disabled
>>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
>>> Repository
>>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>>> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event
>>> Writer for Provenance Event Store Partition[directory=/opt/pr] due to
>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log with
>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time = 1
>>> milliseconds), max Transaction ID 27772177643
>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
>>> Repository with 10 records in 8 milliseconds
>>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> After that, the log continues with normal messages about heartbeats and
>>> provenance events but no error occurs anymore. That exactly was the time,
>>> the whole flow cut-off.
>>>
>>> I've got to add, that we've never experienced that flow cut-off
>>> behaviour with version 1.12.1, but the zookeeper restarts occur there as
>>> well and as often as seen above.
>>> But when first updating to version 1.13.2 back in June, we experienced
>>> this flow cut-off multiple times beside the not working load balancing.
>>> Back then I thought this was connected to each other, but now we had a
>>> fully functional load balancing and the cut-off happened nevertheless.
>>>
>>> I'm thankful for any hint. :)
>>>
>>>
>>>
>>>
>

Re: Disfunctional cluster with version 1.13.2

Posted by Mark Payne <ma...@hotmail.com>.
Ryan,

Have you filed Jiras for these? If so, can you share the links?

Thanks
-Mark

On Aug 16, 2021, at 9:32 PM, Ryan Hendrickson <ry...@gmail.com>> wrote:

Axel,
   We've had significant issues with 1.13.2 in a Cluster as well.  We're working on a test config... Issues range from abandoned FlowFiles, single Nodes locking the entire cluster, load balance relationships not working, and undocumented nifi properties.  We're reluctant to move to 1.14.0 because we haven't seen anything specifically fixed in it.

   Please keep the community up-to-date on your findings.

Ryan

On Mon, Aug 16, 2021 at 11:00 AM Pierre Villard <pi...@gmail.com>> wrote:
Hi,

What's the version of ZK?

Thanks,
Pierre

Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>> a écrit :
Dear all,

after successfully battling the load balancing and installing Version 1.13.2 again in our 3 node production environment, we experienced another failure in the cluster resulting in a complete cut-off of the flow just 1,5h after the update.
We noticed it just by trying to access the webinterface, which immediately after login showed something like:

"Cannot replicate request to Node nifiHost1.contoso.com:8443<http://nifihost1.contoso.com:8443/> because the node is not connected"

There was nothing we could do through the webinterface aside from staring at this message and when looking at the live logs, there was nothing suspicious. The log moved on as if nothing happened.
After a restart of the cluster everything was working fine again, but we saw, that the entire flow wasn't working for some period of time. This alone is really uncool, as we running a cluster for exactly that reason: The flow should keep working, even if some node decides to malfunction for whatever reason.

Digging a little deeper into the logs showed two noticable problems:

1. The Zookeeper is restarting every few minutes. Which in the log always looks like this:

------------------------------------(nifiHost1.contoso.com<http://nifihost1.contoso.com/>) nifi-app.log------------------------------------------------------------------------------------

2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer shutting down
2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.FinalRequestProcessor shutdown of request processor complete
2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.SyncRequestProcessor Shutting down
2021-08-11 12:02:39,196 INFO [SyncThread:1] o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/nifi/logs/zookeeper/version-2 snapdir /opt/nifi/state/zookeeper/version-2
2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling watcher
java.lang.NullPointerException: null
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2021-08-11 12:02:39,277 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-08-11 12:02:40,001 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 Connection State changed to SUSPENDED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e Connection State changed to SUSPENDED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 Connection State changed to RECONNECTED
2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e Connection State changed to RECONNECTED
2021-08-11 12:02:40,014 INFO [main-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {}
2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11 12:02:39,176 and sent to nifiHost3.contoso.com:54321<http://nifihost3.contoso.com:54321/> at 2021-08-11 12:02:40,127; send took 951 millis
2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling watcher
java.lang.NullPointerException: null
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)

------------------------------------------------------------------------------------------------------------------------------------------------------------

This happens, as mentioned, every few minutes and looks like this when condensed through grep:

------------------------------------(nifiHost1.contoso.com<http://nifihost1.contoso.com/>) nifi-app.log | grep "ZooKeeperServer Shutting down"---------------------------------------------

./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down

------------------------------------------------------------------------------------------------------------------------------------------------------------

So my first question is: Is this behaviour of zookeeper normal? Well it is just an INFO, but I would love to here that this is a completely normal phenomenon. :D

2. After all these zookeeper restarts, nifi finally throws the following error along with the "last" zookeeper restart:

------------------------------------(nifiHost1.contoso.com<http://nifihost1.contoso.com/>) nifi-app.log------------------------------------------------------------------------------------

2021-08-11 13:31:13,083 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer shutting down
2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.FinalRequestProcessor shutdown of request processor complete
2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.SyncRequestProcessor Shutting down
2021-08-11 13:31:13,440 INFO [SyncThread:1] o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/nifi/logs/zookeeper/version-2 snapdir /opt/nifi/state/zookeeper/version-2
2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5] o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer nifiHost2.contoso.com/192.168.1.11:19500<http://nifihost2.contoso.com/192.168.1.11:19500>
java.io.EOFException: Expected to receive a Transaction Completion Indicator from Peer nifiHost2.contoso.com<http://nifihost2.contoso.com/> but encountered EOF
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
        at org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
        at java.lang.Thread.run(Thread.java:748)
2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6] o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer nifiHost2.contoso.com/192.168.1.11:19502<http://nifihost2.contoso.com/192.168.1.11:19502>
java.io.EOFException: Expected to receive a Transaction Completion Indicator from Peer nifiHost2.contoso.com<http://nifihost2.contoso.com/> but encountered EOF
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
        at org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
        at org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
        at java.lang.Thread.run(Thread.java:748)
2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1] o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443<http://nifihost1.contoso.com:8443/> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443<http://nifihost1.contoso.com:8443/>, state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443<http://nifihost1.contoso.com:8443/>, state=CONNECTING, updateId=16]
2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST, length=2209040 bytes) from nifiHost1.contoso.com:8443<http://nifihost1.contoso.com:8443/> in 142 millis
2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Processing reconnection request from cluster coordinator.
2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role Primary Node disabled
2021-08-11 13:31:13,802 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Shutting down
2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer shutting down
2021-08-11 13:31:28,198 INFO [pool-22-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.FinalRequestProcessor shutdown of request processor complete
2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.z.server.SyncRequestProcessor Shutting down
2021-08-11 13:31:28,203 INFO [SyncThread:1] o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21] o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event Writer for Provenance Event Store Partition[directory=/opt/pr] due to MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
2021-08-11 13:31:28,208 INFO [pool-22-thread-1] o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log with 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time = 1 milliseconds), max Transaction ID 27772177643
2021-08-11 13:31:28,208 INFO [pool-22-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 10 records in 8 milliseconds
2021-08-11 13:31:28,295 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/nifi/logs/zookeeper/version-2 snapdir /opt/nifi/state/zookeeper/version-2
2021-08-11 13:31:32,291 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED

------------------------------------------------------------------------------------------------------------------------------------------------------------

After that, the log continues with normal messages about heartbeats and provenance events but no error occurs anymore. That exactly was the time, the whole flow cut-off.

I've got to add, that we've never experienced that flow cut-off behaviour with version 1.12.1, but the zookeeper restarts occur there as well and as often as seen above.
But when first updating to version 1.13.2 back in June, we experienced this flow cut-off multiple times beside the not working load balancing. Back then I thought this was connected to each other, but now we had a fully functional load balancing and the cut-off happened nevertheless.

I'm thankful for any hint. :)





Re: Disfunctional cluster with version 1.13.2

Posted by Ryan Hendrickson <ry...@gmail.com>.
Axel - this is the Jira Ticket for NiFi for the missing Sys Admin
Properties: https://issues.apache.org/jira/browse/NIFI-9029

On Mon, Aug 16, 2021 at 9:32 PM Ryan Hendrickson <
ryan.andrew.hendrickson@gmail.com> wrote:

> Axel,
>    We've had significant issues with 1.13.2 in a Cluster as well.  We're
> working on a test config... Issues range from abandoned FlowFiles, single
> Nodes locking the entire cluster, load balance relationships not working,
> and undocumented nifi properties.  We're reluctant to move to 1.14.0
> because we haven't seen anything specifically fixed in it.
>
>    Please keep the community up-to-date on your findings.
>
> Ryan
>
> On Mon, Aug 16, 2021 at 11:00 AM Pierre Villard <
> pierre.villard.fr@gmail.com> wrote:
>
>> Hi,
>>
>> What's the version of ZK?
>>
>> Thanks,
>> Pierre
>>
>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de> a
>> écrit :
>>
>>> Dear all,
>>>
>>> after successfully battling the load balancing and installing Version
>>> 1.13.2 again in our 3 node production environment, we experienced another
>>> failure in the cluster resulting in a complete cut-off of the flow just
>>> 1,5h after the update.
>>> We noticed it just by trying to access the webinterface, which
>>> immediately after login showed something like:
>>>
>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443 because
>>> the node is not connected"
>>>
>>> There was nothing we could do through the webinterface aside from
>>> staring at this message and when looking at the live logs, there was
>>> nothing suspicious. The log moved on as if nothing happened.
>>> After a restart of the cluster everything was working fine again, but we
>>> saw, that the entire flow wasn't working for some period of time. This
>>> alone is really uncool, as we running a cluster for exactly that reason:
>>> The flow should keep working, even if some node decides to malfunction for
>>> whatever reason.
>>>
>>> Digging a little deeper into the logs showed two noticable problems:
>>>
>>> 1. The Zookeeper is restarting every few minutes. Which in the log
>>> always looks like this:
>>>
>>> ------------------------------------(nifiHost1.contoso.com)
>>> nifi-app.log------------------------------------------------------------------------------------
>>>
>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
>>> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling
>>> watcher
>>> java.lang.NullPointerException: null
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>>> Connection State changed to SUSPENDED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>>> Connection State changed to SUSPENDED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>>> Connection State changed to RECONNECTED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>>> Connection State changed to RECONNECTED
>>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
>>> o.a.c.framework.imps.EnsembleTracker New config event received: {}
>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
>>> 12:02:40,127; send took 951 millis
>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
>>> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling
>>> watcher
>>> java.lang.NullPointerException: null
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> This happens, as mentioned, every few minutes and looks like this when
>>> condensed through grep:
>>>
>>> ------------------------------------(nifiHost1.contoso.com)
>>> nifi-app.log | grep "ZooKeeperServer Shutting
>>> down"---------------------------------------------
>>>
>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> So my first question is: Is this behaviour of zookeeper normal? Well it
>>> is just an INFO, but I would love to here that this is a completely normal
>>> phenomenon. :D
>>>
>>> 2. After all these zookeeper restarts, nifi finally throws the following
>>> error along with the "last" zookeeper restart:
>>>
>>> ------------------------------------(nifiHost1.contoso.com)
>>> nifi-app.log------------------------------------------------------------------------------------
>>>
>>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
>>> nifiHost2.contoso.com/192.168.1.11:19500
>>> java.io.EOFException: Expected to receive a Transaction Completion
>>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>>>         at java.lang.Thread.run(Thread.java:748)
>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
>>> nifiHost2.contoso.com/192.168.1.11:19502
>>> java.io.EOFException: Expected to receive a Transaction Completion
>>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>>>         at
>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>>>         at java.lang.Thread.run(Thread.java:748)
>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>>> o.a.n.c.c.node.NodeClusterCoordinator Status of
>>> nifiHost1.contoso.com:8443 changed from NodeConnectionStatus[nodeId=
>>> nifiHost1.contoso.com:8443, state=CONNECTED, updateId=16] to
>>> NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>>> state=CONNECTING, updateId=16]
>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>>> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>> o.a.nifi.controller.StandardFlowService Processing reconnection request
>>> from cluster coordinator.
>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role Primary
>>> Node disabled
>>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
>>> Repository
>>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>>> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event
>>> Writer for Provenance Event Store Partition[directory=/opt/pr] due to
>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log with
>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time = 1
>>> milliseconds), max Transaction ID 27772177643
>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
>>> Repository with 10 records in 8 milliseconds
>>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> After that, the log continues with normal messages about heartbeats and
>>> provenance events but no error occurs anymore. That exactly was the time,
>>> the whole flow cut-off.
>>>
>>> I've got to add, that we've never experienced that flow cut-off
>>> behaviour with version 1.12.1, but the zookeeper restarts occur there as
>>> well and as often as seen above.
>>> But when first updating to version 1.13.2 back in June, we experienced
>>> this flow cut-off multiple times beside the not working load balancing.
>>> Back then I thought this was connected to each other, but now we had a
>>> fully functional load balancing and the cut-off happened nevertheless.
>>>
>>> I'm thankful for any hint. :)
>>>
>>>
>>>
>>>

Re: Re: Disfunctional cluster with version 1.13.2

Posted by Joe Witt <jo...@gmail.com>.
If restarting is helping you then generally it should be easily found.  Is
mem usage spiking?  Are thread dumps revealing?

On Tue, Aug 17, 2021 at 6:12 AM Ryan Hendrickson <
ryan.andrew.hendrickson@gmail.com> wrote:

> We've rolled back to 1.11.4 in a couple scenarios.  We have also setup a
> cron job to restart 1.13.2 nodes once a day.
>
> Ryan
>
> On Tue, Aug 17, 2021 at 1:25 AM Axel Schwarz <Ax...@emailn.de> wrote:
>
>> Hey Ryan,
>>
>> that sounds awefully familiar. What we successfully battled so far is the
>> load balancing problem.
>> You can find the whole plot of this drama in the mailing list archive,
>> title is "No Load Balancing since 1.13.2"
>>
>> Of course I will keep this thread updated, but unfortunately we had to
>> make the decision to role back completely to 1.12.1 because we just cannot
>> afford investing more time into this right now. But we'll certainly come
>> back to this later. We have to...
>>
>> --- Ursprüngliche Nachricht ---
>> Von: Ryan Hendrickson <ry...@gmail.com>
>> Datum: 17.08.2021 03:32:25
>> An: users@nifi.apache.org
>> Betreff: Re: Disfunctional cluster with version 1.13.2
>>
>> > Axel,
>> >    We've had significant issues with 1.13.2 in a Cluster as well.  We're
>> >
>> > working on a test config... Issues range from abandoned FlowFiles,
>> single
>> >
>> > Nodes locking the entire cluster, load balance relationships not
>> working,
>> >
>> > and undocumented nifi properties.  We're reluctant to move to 1.14.0
>> > because we haven't seen anything specifically fixed in it.
>> >
>> >    Please keep the community up-to-date on your findings.
>> >
>> > Ryan
>> >
>> > On Mon, Aug 16, 2021 at 11:00 AM Pierre Villard <
>> pierre.villard.fr@gmail.com>
>> >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > What's the version of ZK?
>> > >
>> > > Thanks,
>> > > Pierre
>> > >
>> > > Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>
>> > a
>> > > écrit :
>> > >
>> > >> Dear all,
>> > >>
>> > >> after successfully battling the load balancing and installing Version
>> >
>> > >> 1.13.2 again in our 3 node production environment, we experienced
>> > another
>> > >> failure in the cluster resulting in a complete cut-off of the flow
>> > just
>> > >> 1,5h after the update.
>> > >> We noticed it just by trying to access the webinterface, which
>> > >> immediately after login showed something like:
>> > >>
>> > >> "Cannot replicate request to Node nifiHost1.contoso.com:8443
>> > because the
>> > >> node is not connected"
>> > >>
>> > >> There was nothing we could do through the webinterface aside from
>> > staring
>> > >> at this message and when looking at the live logs, there was nothing
>> >
>> > >> suspicious. The log moved on as if nothing happened.
>> > >> After a restart of the cluster everything was working fine again,
>> > but we
>> > >> saw, that the entire flow wasn't working for some period of time.
>> > This
>> > >> alone is really uncool, as we running a cluster for exactly that
>> > reason:
>> > >> The flow should keep working, even if some node decides to
>> malfunction
>> > for
>> > >> whatever reason.
>> > >>
>> > >> Digging a little deeper into the logs showed two noticable problems:
>> >
>> > >>
>> > >> 1. The Zookeeper is restarting every few minutes. Which in the log
>> > always
>> > >> looks like this:
>> > >>
>> > >> ------------------------------------(nifiHost1.contoso.com)
>> > >>
>> nifi-app.log------------------------------------------------------------------------------------
>> >
>> > >>
>> > >> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer shutting down
>> > >> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.z.server.FinalRequestProcessor shutdown of request processor
>> > complete
>> > >> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.z.server.SyncRequestProcessor Shutting down
>> > >> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
>> > >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> > >> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>> >
>> > >> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> >
>> > >> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
>> > 2000
>> > >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>> > >> /opt/nifi/logs/zookeeper/version-2 snapdir
>> > >> /opt/nifi/state/zookeeper/version-2
>> > >> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
>> > >> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while
>> > calling
>> > >> watcher
>> > >> java.lang.NullPointerException: null
>> > >>         at
>> > >>
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>> >
>> > >>         at
>> > >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> >
>> > >> 2021-08-11 12:02:39,277 INFO [main-EventThread]
>> > >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> > >> 2021-08-11 12:02:40,001 INFO [main-EventThread]
>> > >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> >
>> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > >>
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> >
>> > >> Connection State changed to SUSPENDED
>> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> >
>> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > >>
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> >
>> > >> Connection State changed to SUSPENDED
>> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> >
>> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > >>
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> >
>> > >> Connection State changed to RECONNECTED
>> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> >
>> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > >>
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> >
>> > >> Connection State changed to RECONNECTED
>> > >> 2021-08-11 12:02:40,014 INFO [main-EventThread]
>> > >> o.a.c.framework.imps.EnsembleTracker New config event received:
>> > {}
>> > >> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>> > >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>> >
>> > >> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
>> >
>> > >> 12:02:40,127; send took 951 millis
>> > >> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
>> > >> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while
>> > calling
>> > >> watcher
>> > >> java.lang.NullPointerException: null
>> > >>         at
>> > >>
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>> >
>> > >>         at
>> > >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> >
>> > >>
>> > >>
>> > >>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >
>> > >>
>> > >> This happens, as mentioned, every few minutes and looks like this
>> > when
>> > >> condensed through grep:
>> > >>
>> > >> ------------------------------------(nifiHost1.contoso.com)
>> nifi-app.log
>> >
>> > >> | grep "ZooKeeperServer Shutting
>> > >> down"---------------------------------------------
>> > >>
>> > >> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
>> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >>
>> > >>
>> > >>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >
>> > >>
>> > >> So my first question is: Is this behaviour of zookeeper normal?
>> > Well it
>> > >> is just an INFO, but I would love to here that this is a completely
>> > normal
>> > >> phenomenon. :D
>> > >>
>> > >> 2. After all these zookeeper restarts, nifi finally throws the
>> following
>> >
>> > >> error along with the "last" zookeeper restart:
>> > >>
>> > >> ------------------------------------(nifiHost1.contoso.com)
>> > >>
>> nifi-app.log------------------------------------------------------------------------------------
>> >
>> > >>
>> > >> 2021-08-11 13:31:13,083 INFO [main-EventThread]
>> > >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> > >> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer shutting down
>> > >> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.z.server.FinalRequestProcessor shutdown of request processor
>> > complete
>> > >> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.z.server.SyncRequestProcessor Shutting down
>> > >> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
>> > >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> > >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>> >
>> > >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> >
>> > >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
>> > 2000
>> > >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>> > >> /opt/nifi/logs/zookeeper/version-2 snapdir
>> > >> /opt/nifi/state/zookeeper/version-2
>> > >> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>> > >> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
>> > with Peer
>> > >> nifiHost2.contoso.com/192.168.1.11:19500
>> > >> java.io.EOFException: Expected to receive a Transaction Completion
>> >
>> > >> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>> >
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>> >
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>> >
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>> >
>> > >>         at java.lang.Thread.run(Thread.java:748)
>> > >> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>> > >> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
>> > with Peer
>> > >> nifiHost2.contoso.com/192.168.1.11:19502
>> > >> java.io.EOFException: Expected to receive a Transaction Completion
>> >
>> > >> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>> >
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>> >
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>> >
>> > >>         at
>> > >>
>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>> >
>> > >>         at java.lang.Thread.run(Thread.java:748)
>> > >> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>> >
>> > >> o.a.n.c.c.node.NodeClusterCoordinator Status of
>> > >> nifiHost1.contoso.com:8443 changed from NodeConnectionStatus[nodeId=
>> >
>> > >> nifiHost1.contoso.com:8443, state=CONNECTED, updateId=16] to
>> > >> NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>> > >> state=CONNECTING, updateId=16]
>> > >> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>> >
>> > >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>> >
>> > >> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>> >
>> > >> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
>> >
>> > >> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>> > >> o.a.nifi.controller.StandardFlowService Processing reconnection
>> > request
>> > >> from cluster coordinator.
>> > >> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>> > >> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role
>> > Primary
>> > >> Node disabled
>> > >> 2021-08-11 13:31:13,802 INFO [main-EventThread]
>> > >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> > >> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> > >> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer shutting down
>> > >> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
>> > >> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of
>> > FlowFile
>> > >> Repository
>> > >> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.z.server.FinalRequestProcessor shutdown of request processor
>> > complete
>> > >> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.z.server.SyncRequestProcessor Shutting down
>> > >> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
>> > >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> > >> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>> > >> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over
>> > Event
>> > >> Writer for Provenance Event Store Partition[directory=/opt/pr] due
>> > to
>> > >> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>> >
>> > >> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>> > >> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead
>> > Log with
>> > >> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time
>> > = 1
>> > >> milliseconds), max Transaction ID 27772177643
>> > >> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>> > >> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed
>> > FlowFile
>> > >> Repository with 10 records in 8 milliseconds
>> > >> 2021-08-11 13:31:28,295 INFO [main-EventThread]
>> > >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> > >> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>> >
>> > >> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> >
>> > >> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
>> )(secure=disabled)]
>> >
>> > >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
>> > 2000
>> > >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>> > >> /opt/nifi/logs/zookeeper/version-2 snapdir
>> > >> /opt/nifi/state/zookeeper/version-2
>> > >> 2021-08-11 13:31:32,291 INFO [main-EventThread]
>> > >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> > >>
>> > >>
>> > >>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >
>> > >>
>> > >> After that, the log continues with normal messages about heartbeats
>> > and
>> > >> provenance events but no error occurs anymore. That exactly was
>> > the time,
>> > >> the whole flow cut-off.
>> > >>
>> > >> I've got to add, that we've never experienced that flow cut-off
>> > behaviour
>> > >> with version 1.12.1, but the zookeeper restarts occur there as well
>> > and as
>> > >> often as seen above.
>> > >> But when first updating to version 1.13.2 back in June, we
>> experienced
>> >
>> > >> this flow cut-off multiple times beside the not working load
>> balancing.
>> >
>> > >> Back then I thought this was connected to each other, but now we
>> > had a
>> > >> fully functional load balancing and the cut-off happened
>> nevertheless.
>> >
>> > >>
>> > >> I'm thankful for any hint. :)
>> > >>
>> > >>
>> > >>
>> > >>
>> >
>>
>>
>>

Re: Re: Disfunctional cluster with version 1.13.2

Posted by Ryan Hendrickson <ry...@gmail.com>.
We've rolled back to 1.11.4 in a couple scenarios.  We have also setup a
cron job to restart 1.13.2 nodes once a day.

Ryan

On Tue, Aug 17, 2021 at 1:25 AM Axel Schwarz <Ax...@emailn.de> wrote:

> Hey Ryan,
>
> that sounds awefully familiar. What we successfully battled so far is the
> load balancing problem.
> You can find the whole plot of this drama in the mailing list archive,
> title is "No Load Balancing since 1.13.2"
>
> Of course I will keep this thread updated, but unfortunately we had to
> make the decision to role back completely to 1.12.1 because we just cannot
> afford investing more time into this right now. But we'll certainly come
> back to this later. We have to...
>
> --- Ursprüngliche Nachricht ---
> Von: Ryan Hendrickson <ry...@gmail.com>
> Datum: 17.08.2021 03:32:25
> An: users@nifi.apache.org
> Betreff: Re: Disfunctional cluster with version 1.13.2
>
> > Axel,
> >    We've had significant issues with 1.13.2 in a Cluster as well.  We're
> >
> > working on a test config... Issues range from abandoned FlowFiles, single
> >
> > Nodes locking the entire cluster, load balance relationships not working,
> >
> > and undocumented nifi properties.  We're reluctant to move to 1.14.0
> > because we haven't seen anything specifically fixed in it.
> >
> >    Please keep the community up-to-date on your findings.
> >
> > Ryan
> >
> > On Mon, Aug 16, 2021 at 11:00 AM Pierre Villard <
> pierre.villard.fr@gmail.com>
> >
> > wrote:
> >
> > > Hi,
> > >
> > > What's the version of ZK?
> > >
> > > Thanks,
> > > Pierre
> > >
> > > Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>
> > a
> > > écrit :
> > >
> > >> Dear all,
> > >>
> > >> after successfully battling the load balancing and installing Version
> >
> > >> 1.13.2 again in our 3 node production environment, we experienced
> > another
> > >> failure in the cluster resulting in a complete cut-off of the flow
> > just
> > >> 1,5h after the update.
> > >> We noticed it just by trying to access the webinterface, which
> > >> immediately after login showed something like:
> > >>
> > >> "Cannot replicate request to Node nifiHost1.contoso.com:8443
> > because the
> > >> node is not connected"
> > >>
> > >> There was nothing we could do through the webinterface aside from
> > staring
> > >> at this message and when looking at the live logs, there was nothing
> >
> > >> suspicious. The log moved on as if nothing happened.
> > >> After a restart of the cluster everything was working fine again,
> > but we
> > >> saw, that the entire flow wasn't working for some period of time.
> > This
> > >> alone is really uncool, as we running a cluster for exactly that
> > reason:
> > >> The flow should keep working, even if some node decides to malfunction
> > for
> > >> whatever reason.
> > >>
> > >> Digging a little deeper into the logs showed two noticable problems:
> >
> > >>
> > >> 1. The Zookeeper is restarting every few minutes. Which in the log
> > always
> > >> looks like this:
> > >>
> > >> ------------------------------------(nifiHost1.contoso.com)
> > >>
> nifi-app.log------------------------------------------------------------------------------------
> >
> > >>
> > >> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer shutting down
> > >> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.z.server.FinalRequestProcessor shutdown of request processor
> > complete
> > >> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.z.server.SyncRequestProcessor Shutting down
> > >> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> > >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> > >> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> >
> > >> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
> >
> > >> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> > 2000
> > >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> > >> /opt/nifi/logs/zookeeper/version-2 snapdir
> > >> /opt/nifi/state/zookeeper/version-2
> > >> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
> > >> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while
> > calling
> > >> watcher
> > >> java.lang.NullPointerException: null
> > >>         at
> > >>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
> >
> > >>         at
> > >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> >
> > >> 2021-08-11 12:02:39,277 INFO [main-EventThread]
> > >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> > >> 2021-08-11 12:02:40,001 INFO [main-EventThread]
> > >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> >
> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
> >
> > >> Connection State changed to SUSPENDED
> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> >
> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
> >
> > >> Connection State changed to SUSPENDED
> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> >
> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
> >
> > >> Connection State changed to RECONNECTED
> > >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> >
> > >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
> >
> > >> Connection State changed to RECONNECTED
> > >> 2021-08-11 12:02:40,014 INFO [main-EventThread]
> > >> o.a.c.framework.imps.EnsembleTracker New config event received:
> > {}
> > >> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
> > >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
> >
> > >> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
> >
> > >> 12:02:40,127; send took 951 millis
> > >> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
> > >> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while
> > calling
> > >> watcher
> > >> java.lang.NullPointerException: null
> > >>         at
> > >>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
> >
> > >>         at
> > >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> >
> > >>
> > >>
> > >>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > >>
> > >> This happens, as mentioned, every few minutes and looks like this
> > when
> > >> condensed through grep:
> > >>
> > >> ------------------------------------(nifiHost1.contoso.com)
> nifi-app.log
> >
> > >> | grep "ZooKeeperServer Shutting
> > >> down"---------------------------------------------
> > >>
> > >> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
> > >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >>
> > >>
> > >>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > >>
> > >> So my first question is: Is this behaviour of zookeeper normal?
> > Well it
> > >> is just an INFO, but I would love to here that this is a completely
> > normal
> > >> phenomenon. :D
> > >>
> > >> 2. After all these zookeeper restarts, nifi finally throws the
> following
> >
> > >> error along with the "last" zookeeper restart:
> > >>
> > >> ------------------------------------(nifiHost1.contoso.com)
> > >>
> nifi-app.log------------------------------------------------------------------------------------
> >
> > >>
> > >> 2021-08-11 13:31:13,083 INFO [main-EventThread]
> > >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> > >> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer shutting down
> > >> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.z.server.FinalRequestProcessor shutdown of request processor
> > complete
> > >> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.z.server.SyncRequestProcessor Shutting down
> > >> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> > >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> > >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> >
> > >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
> >
> > >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> > 2000
> > >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> > >> /opt/nifi/logs/zookeeper/version-2 snapdir
> > >> /opt/nifi/state/zookeeper/version-2
> > >> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
> > >> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
> > with Peer
> > >> nifiHost2.contoso.com/192.168.1.11:19500
> > >> java.io.EOFException: Expected to receive a Transaction Completion
> >
> > >> Indicator from Peer nifiHost2.contoso.com but encountered EOF
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
> >
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
> >
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
> >
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
> >
> > >>         at java.lang.Thread.run(Thread.java:748)
> > >> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
> > >> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
> > with Peer
> > >> nifiHost2.contoso.com/192.168.1.11:19502
> > >> java.io.EOFException: Expected to receive a Transaction Completion
> >
> > >> Indicator from Peer nifiHost2.contoso.com but encountered EOF
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
> >
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
> >
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
> >
> > >>         at
> > >>
> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
> >
> > >>         at java.lang.Thread.run(Thread.java:748)
> > >> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
> >
> > >> o.a.n.c.c.node.NodeClusterCoordinator Status of
> > >> nifiHost1.contoso.com:8443 changed from NodeConnectionStatus[nodeId=
> >
> > >> nifiHost1.contoso.com:8443, state=CONNECTED, updateId=16] to
> > >> NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
> > >> state=CONNECTING, updateId=16]
> > >> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
> >
> > >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> >
> > >> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
> >
> > >> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
> >
> > >> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> > >> o.a.nifi.controller.StandardFlowService Processing reconnection
> > request
> > >> from cluster coordinator.
> > >> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> > >> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role
> > Primary
> > >> Node disabled
> > >> 2021-08-11 13:31:13,802 INFO [main-EventThread]
> > >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> > >> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> > >> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer shutting down
> > >> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
> > >> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of
> > FlowFile
> > >> Repository
> > >> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.z.server.FinalRequestProcessor shutdown of request processor
> > complete
> > >> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.z.server.SyncRequestProcessor Shutting down
> > >> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> > >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> > >> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
> > >> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over
> > Event
> > >> Writer for Provenance Event Store Partition[directory=/opt/pr] due
> > to
> > >> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
> >
> > >> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> > >> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead
> > Log with
> > >> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time
> > = 1
> > >> milliseconds), max Transaction ID 27772177643
> > >> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> > >> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed
> > FlowFile
> > >> Repository with 10 records in 8 milliseconds
> > >> 2021-08-11 13:31:28,295 INFO [main-EventThread]
> > >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> > >> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> >
> > >> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
> >
> > >> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181
> )(secure=disabled)]
> >
> > >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> > 2000
> > >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> > >> /opt/nifi/logs/zookeeper/version-2 snapdir
> > >> /opt/nifi/state/zookeeper/version-2
> > >> 2021-08-11 13:31:32,291 INFO [main-EventThread]
> > >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> > >>
> > >>
> > >>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > >>
> > >> After that, the log continues with normal messages about heartbeats
> > and
> > >> provenance events but no error occurs anymore. That exactly was
> > the time,
> > >> the whole flow cut-off.
> > >>
> > >> I've got to add, that we've never experienced that flow cut-off
> > behaviour
> > >> with version 1.12.1, but the zookeeper restarts occur there as well
> > and as
> > >> often as seen above.
> > >> But when first updating to version 1.13.2 back in June, we experienced
> >
> > >> this flow cut-off multiple times beside the not working load
> balancing.
> >
> > >> Back then I thought this was connected to each other, but now we
> > had a
> > >> fully functional load balancing and the cut-off happened nevertheless.
> >
> > >>
> > >> I'm thankful for any hint. :)
> > >>
> > >>
> > >>
> > >>
> >
>
>
>

Re: Re: Disfunctional cluster with version 1.13.2

Posted by Axel Schwarz <Ax...@emailn.de>.
Hey Ryan,

that sounds awefully familiar. What we successfully battled so far is the load balancing problem.
You can find the whole plot of this drama in the mailing list archive, title is "No Load Balancing since 1.13.2"

Of course I will keep this thread updated, but unfortunately we had to make the decision to role back completely to 1.12.1 because we just cannot afford investing more time into this right now. But we'll certainly come back to this later. We have to...

--- Ursprüngliche Nachricht ---
Von: Ryan Hendrickson <ry...@gmail.com>
Datum: 17.08.2021 03:32:25
An: users@nifi.apache.org
Betreff: Re: Disfunctional cluster with version 1.13.2

> Axel,
>    We've had significant issues with 1.13.2 in a Cluster as well.  We're
>
> working on a test config... Issues range from abandoned FlowFiles, single
>
> Nodes locking the entire cluster, load balance relationships not working,
>
> and undocumented nifi properties.  We're reluctant to move to 1.14.0
> because we haven't seen anything specifically fixed in it.
>
>    Please keep the community up-to-date on your findings.
>
> Ryan
>
> On Mon, Aug 16, 2021 at 11:00 AM Pierre Villard <pi...@gmail.com>
>
> wrote:
>
> > Hi,
> >
> > What's the version of ZK?
> >
> > Thanks,
> > Pierre
> >
> > Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>
> a
> > écrit :
> >
> >> Dear all,
> >>
> >> after successfully battling the load balancing and installing Version
>
> >> 1.13.2 again in our 3 node production environment, we experienced
> another
> >> failure in the cluster resulting in a complete cut-off of the flow
> just
> >> 1,5h after the update.
> >> We noticed it just by trying to access the webinterface, which
> >> immediately after login showed something like:
> >>
> >> "Cannot replicate request to Node nifiHost1.contoso.com:8443
> because the
> >> node is not connected"
> >>
> >> There was nothing we could do through the webinterface aside from
> staring
> >> at this message and when looking at the live logs, there was nothing
>
> >> suspicious. The log moved on as if nothing happened.
> >> After a restart of the cluster everything was working fine again,
> but we
> >> saw, that the entire flow wasn't working for some period of time.
> This
> >> alone is really uncool, as we running a cluster for exactly that
> reason:
> >> The flow should keep working, even if some node decides to malfunction
> for
> >> whatever reason.
> >>
> >> Digging a little deeper into the logs showed two noticable problems:
>
> >>
> >> 1. The Zookeeper is restarting every few minutes. Which in the log
> always
> >> looks like this:
> >>
> >> ------------------------------------(nifiHost1.contoso.com)
> >> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer shutting down
> >> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.z.server.FinalRequestProcessor shutdown of request processor
> complete
> >> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.z.server.SyncRequestProcessor Shutting down
> >> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> >> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>
> >> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>
> >> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> 2000
> >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> >> /opt/nifi/logs/zookeeper/version-2 snapdir
> >> /opt/nifi/state/zookeeper/version-2
> >> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
> >> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while
> calling
> >> watcher
> >> java.lang.NullPointerException: null
> >>         at
> >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>         at
> >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >> 2021-08-11 12:02:39,277 INFO [main-EventThread]
> >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> >> 2021-08-11 12:02:40,001 INFO [main-EventThread]
> >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >> Connection State changed to SUSPENDED
> >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >> Connection State changed to SUSPENDED
> >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >> Connection State changed to RECONNECTED
> >> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >> Connection State changed to RECONNECTED
> >> 2021-08-11 12:02:40,014 INFO [main-EventThread]
> >> o.a.c.framework.imps.EnsembleTracker New config event received:
> {}
> >> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>
> >> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
>
> >> 12:02:40,127; send took 951 millis
> >> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
> >> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while
> calling
> >> watcher
> >> java.lang.NullPointerException: null
> >>         at
> >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>         at
> >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >>
> >>
> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >> This happens, as mentioned, every few minutes and looks like this
> when
> >> condensed through grep:
> >>
> >> ------------------------------------(nifiHost1.contoso.com) nifi-app.log
>
> >> | grep "ZooKeeperServer Shutting
> >> down"---------------------------------------------
> >>
> >> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
> >> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>
> >>
> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >> So my first question is: Is this behaviour of zookeeper normal?
> Well it
> >> is just an INFO, but I would love to here that this is a completely
> normal
> >> phenomenon. :D
> >>
> >> 2. After all these zookeeper restarts, nifi finally throws the following
>
> >> error along with the "last" zookeeper restart:
> >>
> >> ------------------------------------(nifiHost1.contoso.com)
> >> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >> 2021-08-11 13:31:13,083 INFO [main-EventThread]
> >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> >> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer shutting down
> >> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.z.server.FinalRequestProcessor shutdown of request processor
> complete
> >> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.z.server.SyncRequestProcessor Shutting down
> >> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>
> >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>
> >> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> 2000
> >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> >> /opt/nifi/logs/zookeeper/version-2 snapdir
> >> /opt/nifi/state/zookeeper/version-2
> >> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
> >> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
> with Peer
> >> nifiHost2.contoso.com/192.168.1.11:19500
> >> java.io.EOFException: Expected to receive a Transaction Completion
>
> >> Indicator from Peer nifiHost2.contoso.com but encountered EOF
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>         at java.lang.Thread.run(Thread.java:748)
> >> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
> >> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
> with Peer
> >> nifiHost2.contoso.com/192.168.1.11:19502
> >> java.io.EOFException: Expected to receive a Transaction Completion
>
> >> Indicator from Peer nifiHost2.contoso.com but encountered EOF
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>         at
> >> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>         at java.lang.Thread.run(Thread.java:748)
> >> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>
> >> o.a.n.c.c.node.NodeClusterCoordinator Status of
> >> nifiHost1.contoso.com:8443 changed from NodeConnectionStatus[nodeId=
>
> >> nifiHost1.contoso.com:8443, state=CONNECTED, updateId=16] to
> >> NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
> >> state=CONNECTING, updateId=16]
> >> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>
> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>
> >> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>
> >> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
>
> >> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> >> o.a.nifi.controller.StandardFlowService Processing reconnection
> request
> >> from cluster coordinator.
> >> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> >> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role
> Primary
> >> Node disabled
> >> 2021-08-11 13:31:13,802 INFO [main-EventThread]
> >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> >> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer shutting down
> >> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
> >> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of
> FlowFile
> >> Repository
> >> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.z.server.FinalRequestProcessor shutdown of request processor
> complete
> >> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.z.server.SyncRequestProcessor Shutting down
> >> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> >> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> >> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
> >> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over
> Event
> >> Writer for Provenance Event Store Partition[directory=/opt/pr] due
> to
> >> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>
> >> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead
> Log with
> >> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time
> = 1
> >> milliseconds), max Transaction ID 27772177643
> >> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed
> FlowFile
> >> Repository with 10 records in 8 milliseconds
> >> 2021-08-11 13:31:28,295 INFO [main-EventThread]
> >> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> >> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>
> >> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>
> >> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> 2000
> >> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> >> /opt/nifi/logs/zookeeper/version-2 snapdir
> >> /opt/nifi/state/zookeeper/version-2
> >> 2021-08-11 13:31:32,291 INFO [main-EventThread]
> >> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> >>
> >>
> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >> After that, the log continues with normal messages about heartbeats
> and
> >> provenance events but no error occurs anymore. That exactly was
> the time,
> >> the whole flow cut-off.
> >>
> >> I've got to add, that we've never experienced that flow cut-off
> behaviour
> >> with version 1.12.1, but the zookeeper restarts occur there as well
> and as
> >> often as seen above.
> >> But when first updating to version 1.13.2 back in June, we experienced
>
> >> this flow cut-off multiple times beside the not working load balancing.
>
> >> Back then I thought this was connected to each other, but now we
> had a
> >> fully functional load balancing and the cut-off happened nevertheless.
>
> >>
> >> I'm thankful for any hint. :)
> >>
> >>
> >>
> >>
>



Re: Disfunctional cluster with version 1.13.2

Posted by Ryan Hendrickson <ry...@gmail.com>.
Axel,
   We've had significant issues with 1.13.2 in a Cluster as well.  We're
working on a test config... Issues range from abandoned FlowFiles, single
Nodes locking the entire cluster, load balance relationships not working,
and undocumented nifi properties.  We're reluctant to move to 1.14.0
because we haven't seen anything specifically fixed in it.

   Please keep the community up-to-date on your findings.

Ryan

On Mon, Aug 16, 2021 at 11:00 AM Pierre Villard <pi...@gmail.com>
wrote:

> Hi,
>
> What's the version of ZK?
>
> Thanks,
> Pierre
>
> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de> a
> écrit :
>
>> Dear all,
>>
>> after successfully battling the load balancing and installing Version
>> 1.13.2 again in our 3 node production environment, we experienced another
>> failure in the cluster resulting in a complete cut-off of the flow just
>> 1,5h after the update.
>> We noticed it just by trying to access the webinterface, which
>> immediately after login showed something like:
>>
>> "Cannot replicate request to Node nifiHost1.contoso.com:8443 because the
>> node is not connected"
>>
>> There was nothing we could do through the webinterface aside from staring
>> at this message and when looking at the live logs, there was nothing
>> suspicious. The log moved on as if nothing happened.
>> After a restart of the cluster everything was working fine again, but we
>> saw, that the entire flow wasn't working for some period of time. This
>> alone is really uncool, as we running a cluster for exactly that reason:
>> The flow should keep working, even if some node decides to malfunction for
>> whatever reason.
>>
>> Digging a little deeper into the logs showed two noticable problems:
>>
>> 1. The Zookeeper is restarting every few minutes. Which in the log always
>> looks like this:
>>
>> ------------------------------------(nifiHost1.contoso.com)
>> nifi-app.log------------------------------------------------------------------------------------
>>
>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer shutting down
>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.z.server.SyncRequestProcessor Shutting down
>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>> /opt/nifi/logs/zookeeper/version-2 snapdir
>> /opt/nifi/state/zookeeper/version-2
>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
>> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling
>> watcher
>> java.lang.NullPointerException: null
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> Connection State changed to SUSPENDED
>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> Connection State changed to SUSPENDED
>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> Connection State changed to RECONNECTED
>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager
>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> Connection State changed to RECONNECTED
>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
>> o.a.c.framework.imps.EnsembleTracker New config event received: {}
>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
>> 12:02:40,127; send took 951 millis
>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
>> Thread-27-EventThread] org.apache.zookeeper.ClientCnxn Error while calling
>> watcher
>> java.lang.NullPointerException: null
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> This happens, as mentioned, every few minutes and looks like this when
>> condensed through grep:
>>
>> ------------------------------------(nifiHost1.contoso.com) nifi-app.log
>> | grep "ZooKeeperServer Shutting
>> down"---------------------------------------------
>>
>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> So my first question is: Is this behaviour of zookeeper normal? Well it
>> is just an INFO, but I would love to here that this is a completely normal
>> phenomenon. :D
>>
>> 2. After all these zookeeper restarts, nifi finally throws the following
>> error along with the "last" zookeeper restart:
>>
>> ------------------------------------(nifiHost1.contoso.com)
>> nifi-app.log------------------------------------------------------------------------------------
>>
>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer shutting down
>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.z.server.SyncRequestProcessor Shutting down
>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>> /opt/nifi/logs/zookeeper/version-2 snapdir
>> /opt/nifi/state/zookeeper/version-2
>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
>> nifiHost2.contoso.com/192.168.1.11:19500
>> java.io.EOFException: Expected to receive a Transaction Completion
>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>>         at
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>>         at
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>>         at
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>>         at
>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>>         at java.lang.Thread.run(Thread.java:748)
>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
>> nifiHost2.contoso.com/192.168.1.11:19502
>> java.io.EOFException: Expected to receive a Transaction Completion
>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>>         at
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>>         at
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>>         at
>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>>         at
>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>>         at java.lang.Thread.run(Thread.java:748)
>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>> o.a.n.c.c.node.NodeClusterCoordinator Status of
>> nifiHost1.contoso.com:8443 changed from NodeConnectionStatus[nodeId=
>> nifiHost1.contoso.com:8443, state=CONNECTED, updateId=16] to
>> NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>> state=CONNECTING, updateId=16]
>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>> o.a.nifi.controller.StandardFlowService Processing reconnection request
>> from cluster coordinator.
>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role Primary
>> Node disabled
>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer shutting down
>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
>> Repository
>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.z.server.SyncRequestProcessor Shutting down
>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event
>> Writer for Provenance Event Store Partition[directory=/opt/pr] due to
>> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log with
>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time = 1
>> milliseconds), max Transaction ID 27772177643
>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
>> Repository with 10 records in 8 milliseconds
>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>> /opt/nifi/logs/zookeeper/version-2 snapdir
>> /opt/nifi/state/zookeeper/version-2
>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> After that, the log continues with normal messages about heartbeats and
>> provenance events but no error occurs anymore. That exactly was the time,
>> the whole flow cut-off.
>>
>> I've got to add, that we've never experienced that flow cut-off behaviour
>> with version 1.12.1, but the zookeeper restarts occur there as well and as
>> often as seen above.
>> But when first updating to version 1.13.2 back in June, we experienced
>> this flow cut-off multiple times beside the not working load balancing.
>> Back then I thought this was connected to each other, but now we had a
>> fully functional load balancing and the cut-off happened nevertheless.
>>
>> I'm thankful for any hint. :)
>>
>>
>>
>>

Re: Re: Re: Disfunctional cluster with version 1.13.2

Posted by Axel Schwarz <Ax...@emailn.de>.
For anyone having similiar behaviour/problems, I want to tell you guys about what we did to address this so far:

We installed 1.13.2 again, as we thought we cleared any inconvinience with "external" stuff like ZooKeeper (established an external cluster as proposed by Mark) and MinIO (stabilized the cluster in general following best practices) and running smoothely for weeks under version 1.12.1.
9 hours after the update the cluster stopped working. Login didn't work anymore and the flow did not take in any data from our servers. So we were back to the state, at which I posted this thread originally.

After some research I think I found the reason for this behaviour, which seems to be that there were no elected Primary Node and Cluster Coordinator anymore. We still had the issues with ZooKeeper, as described in the original post, despite relying on an external cluster already. ZooKeeper does not restart every few minutes anymore, but the Nifi nodes changed to state SUSPENDED and back to RECONNECTED at around every half an hour. As also already mentioned in the original post, this behaviour also took place under version 1.12.1, but did never end in a disfunctional cluster.

So, as this seems to be an issue with election and/or ZooKeeper connection stability, I started to further stabilize the ZooKeeper cluster, by digging deeper into best practices etc. and did the following:

- Increased RAM from 2GB to 4GB per node
- Limit Java Heap Size to 2GB for ZooKeeper (create file "/[zookeeper_home]/conf/java.env" with content "export JVMFLAGS="-Xmx2048m"")
- disable swappiness to be sure linux never swapps, as this would be "dead to ZooKeeper" (add the line "vm.swappiness=0" to /etc/sysctl.conf)
- Increased ZooKeeper timeouts in Nifi config from 10s to 30s (nifi.zookeeper.connect.timeout and nifi.zookeeper.session.timeout)

So far (around 20h running) there are no SUSPENDED/RECONNECTED messages in the Nifi log anymore and the cluster is running pretty nicely. My personal feeling is, that the increase of these timeouts did the trick, but of course thats just a feeling as I cannot proof it.

I hope this will last and I'll keep you posted. :)

--- Ursprüngliche Nachricht ---
Von: "Axel Schwarz" <Ax...@emailn.de>
Datum: 27.08.2021 11:07:13
An: users@nifi.apache.org
Betreff: Re: Re: Disfunctional cluster with version 1.13.2

> Hey Mark,
>
> thanks again for your help. We just installed a dedicated Zookeeper Cluster.
> I will report our experiences regarding this topic as soon as we regained
> the courage to update to version 1.13.2 again :D
>
> --- Ursprüngliche Nachricht ---
> Von: Mark Payne <ma...@hotmail.com>
> Datum: 23.08.2021 15:50:35
> An: "users@nifi.apache.org" <us...@nifi.apache.org>, Axel
> Schwarz 	<Ax...@emailn.de>
> Betreff: Re: Disfunctional cluster with version 1.13.2
>
> > Hey Axel,
> >
> > Yes - it is definitely advisable to run zookeeper external to NiFi,
> on another
> > server/VM. In fact - I would not recommend running any other service
> co-located
> > with NiFi. NiFi is really designed to be quite greedy and make use of
> whatever
> > hardware you allow it to. This can be a problem specifically with ZooKeeper.
>
> > If a NiFi node gets really busy (due to a sudden burst of traffic, for
> example),
> > it can easily starve Zookeeper of threads. This can happen due to over-utilization
>
> > of the CPU, or it can happen due to a lot of garbage collection, etc.
> Typically
> > this is a very temporary thing and NiFi happily recovers once the load
> has
> > been worked down.
> >
> > But when this happens, if using an embedded ZK, all of the nodes in
> the NiFi
> > cluster would start having trouble communicating with ZooKeeper. As
> a result,
> > they will all start disconnecting and reconnecting. This, in turn, puts
> greater
> > stress on the node that is already falling behind. And so begins a cycle
>
> > that can lead to cluster instability. Running an external ZooKeeper
> on the
> > same node is better than embedded because at least it doesn’t share
> the same
> > JVM / heap / garbage collection, etc. But contention with disks &
> cpu
> > can still cause the same issues. So running it elsewhere is definitely
> recommended.
> >
> >
> > In terms of how many zookeeper servers to use - it really depends on
> what
> > you need. A single Zookeeper instance is enough to get up & running
> but
> > becomes a single point of failure, so typically 3 nodes are recommended
> for
> > ZooKeeper (though the nodes don’t need to be particularly powerful).
> I don’t
> > think adding any additional nodes provides any benefit, beyond making
> it
> > more resilient if additional ZK nodes fail.
> >
> > Thanks
> > -Mark
> >
> > > On Aug 23, 2021, at 4:30 AM, Axel Schwarz <Ax...@emailn.de>
>
> > wrote:
> > >
> > > Hey Mark,
> > >
> > > thanks for your answer. We always had an external zookeeper running,
>
> > but on the same hosts as the nifi instances. But as we were researching
> through
> > the load balancing problem in version 1.13.2, we decided to use the
> embedded
> > zookeeper, because in the documentation there is no hint at all to use
> an
> > external zookeeper instance in production environment. So we just thought
>
> > we should get rid of that possible source of error. Like incompatible
> versions
> > etc.
> > >
> > > So just to be clear: It's officially advisable to run some external
>
> > zookeeper instance on extra servers? If so, which version should we
> use?
> > How many servers should a zookeeper cluster have to manage 3-5 Nifi-Hosts?
>
> >
> > >
> > > Gotta say, I really appreciate all the work this community puts
> into
> > helping others. I hope I can give something back some time.
> > >
> > > --- Ursprüngliche Nachricht ---
> > > Von: Mark Payne <ma...@hotmail.com>
> > > Datum: 17.08.2021 16:13:29
> > > An: "users@nifi.apache.org" <us...@nifi.apache.org>,
>
> > Axel Schwarz 	<Ax...@emailn.de>
> > > Betreff: Re: Disfunctional cluster with version 1.13.2
> > >
> > >> Axel,
> > >>
> > >> So a few things to consider here.
> > >>
> > >> The NullPointer that you’re showing there in the logs was fixed
>
> > in 1.14.0
> > >> [1].
> > >>
> > >> Secondly, is a question of why the ZooKeeper client is disconnecting.
>
> > Given
> > >> that you’re running an embedded zookeeper server, it could
> be for
> > any number
> > >> of reasons - low java heap, too high CPU usage, over-utilized
> disks,
> > etc.
> > >> The embedded zookeeper is a convenience that should be used
> for
> > testing &
> > >> playing around locally but should never be used for production
> use
> > - you
> > >> should always depend on an external zookeeper for any production
>
> > use case,
> > >> running on a different machine (or set of machines). I suspect
> moving
> > to
> > >> an external zookeeper alone will be enough to help significantly
>
> > here.
> > >>
> > >> But if you run into issues after moving to an external zookeeper,
>
> > would recommend
> > >> running “bin/nifi.sh diagnostics diag1.txt” when you’re having
> trouble.
> > This
> > >> will gather all sorts of information that you should be able
> to
> > share, including
> > >> garbage collection information, how many times the cluster
> coordinator
> > is
> > >> changing, etc. that can be helpful in diagnosing cluster-related
>
> > issues.
> > >>
> > >>
> > >> Thanks
> > >> -Mark
> > >>
> > >> [1] https://issues.apache.org/jira/browse/NIFI-8394
> > >>
> > >>
> > >>> On Aug 17, 2021, at 1:18 AM, Axel Schwarz <Ax...@emailn.de>
>
> >
> > >> wrote:
> > >>>
> > >>> Hey Pierre,
> > >>>
> > >>> we're currently using the embedded ZK. So as far as I know
> this
> > is Version
> > >> 3.5.6
> > >>>
> > >>> --- Ursprüngliche Nachricht ---
> > >>> Von: Pierre Villard <pi...@gmail.com>
>
> > >>> Datum: 16.08.2021 16:59:22
> > >>> An: users@nifi.apache.org, Axel Schwarz <Ax...@emailn.de>
>
> >
> > >>
> > >>> Betreff: Re: Disfunctional cluster with version 1.13.2
>
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> What's the version of ZK?
> > >>>>
> > >>>> Thanks,
> > >>>> Pierre
> > >>>>
> > >>>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>
>
> >
> > >> a
> > >>>> écrit :
> > >>>>
> > >>>>> Dear all,
> > >>>>>
> > >>>>> after successfully battling the load balancing
> and installing
> >
> > >> Version
> > >>>>
> > >>>>> 1.13.2 again in our 3 node production environment,
> we
> > experienced
> > >> another
> > >>>>
> > >>>>> failure in the cluster resulting in a complete
> cut-off
> > of the
> > >> flow just
> > >>>>
> > >>>>> 1,5h after the update.
> > >>>>> We noticed it just by trying to access the webinterface,
>
> > which
> > >> immediately
> > >>>>
> > >>>>> after login showed something like:
> > >>>>>
> > >>>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443
>
> >
> > >> because
> > >>>> the
> > >>>>> node is not connected"
> > >>>>>
> > >>>>> There was nothing we could do through the webinterface
>
> > aside
> > >> from staring
> > >>>>
> > >>>>> at this message and when looking at the live logs,
> there
> > was
> > >> nothing
> > >>>>
> > >>>>> suspicious. The log moved on as if nothing happened.
>
> >
> > >>>>> After a restart of the cluster everything was working
>
> > fine again,
> > >> but
> > >>>> we
> > >>>>> saw, that the entire flow wasn't working for some
> period
> > of
> > >> time. This
> > >>>>
> > >>>>> alone is really uncool, as we running a cluster
> for
> > exactly
> > >> that reason:
> > >>>>
> > >>>>> The flow should keep working, even if some node
> decides
> > to malfunction
> > >>
> > >>>> for
> > >>>>> whatever reason.
> > >>>>>
> > >>>>> Digging a little deeper into the logs showed two
> noticable
> > problems:
> > >>
> > >>>>
> > >>>>>
> > >>>>> 1. The Zookeeper is restarting every few minutes.
> Which
> > in the
> > >> log always
> > >>>>
> > >>>>> looks like this:
> > >>>>>
> > >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >
> > >>
> > >>>>> nifi-app.log------------------------------------------------------------------------------------
>
> >
> > >>
> > >>>>
> > >>>>>
> > >>>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>
> > >>>>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.z.server.FinalRequestProcessor shutdown of
> request
> > processor
> > >> complete
> > >>>>
> > >>>>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.z.server.SyncRequestProcessor Shutting down
>
> > >>>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> > >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
>
> > exited!
> > >>
> > >>>>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
>
> > set to
> > >> 4000
> > >>>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
>
> > set to
> > >> 40000
> > >>>>
> > >>>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
>
> > with tickTime
> > >> 2000
> > >>>>
> > >>>>> minSessionTimeout 4000 maxSessionTimeout 40000
> datadir
> >
> > >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> > >>>>> /opt/nifi/state/zookeeper/version-2
> > >>>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
>
> > Thread-27-EventThread]
> > >>
> > >>>>
> > >>>>> org.apache.zookeeper.ClientCnxn Error while calling
>
> > watcher
> > >>
> > >>>>> java.lang.NullPointerException: null
> > >>>>>       at
> > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >
> > >>
> > >>>>
> > >>>>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
>
> > >>>>> o.a.c.f.state.ConnectionStateManager State change:
> SUSPENDED
> >
> > >>
> > >>>>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
>
> > >>>>> o.a.c.f.state.ConnectionStateManager State change:
> RECONNECTED
> >
> > >>
> > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >
> > >>
> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >
> > >>
> > >>>>
> > >>>>> Connection State changed to SUSPENDED
> > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >
> > >>
> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >
> > >>
> > >>>>
> > >>>>> Connection State changed to SUSPENDED
> > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >
> > >>
> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >
> > >>
> > >>>>
> > >>>>> Connection State changed to RECONNECTED
> > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >
> > >>
> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >
> > >>
> > >>>>
> > >>>>> Connection State changed to RECONNECTED
> > >>>>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
>
> > >>>>> o.a.c.framework.imps.EnsembleTracker New config
> event
> > received:
> > >> {}
> > >>>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks
> Thread-2]
> >
> > >>>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat
> created
> > at 2021-08-11
> > >>
> > >>>>
> > >>>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321
>
> > at 2021-08-11
> > >>
> > >>>>> 12:02:40,127; send took 951 millis
> > >>>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
>
> > Thread-27-EventThread]
> > >>
> > >>>>
> > >>>>> org.apache.zookeeper.ClientCnxn Error while calling
>
> > watcher
> > >>
> > >>>>> java.lang.NullPointerException: null
> > >>>>>       at
> > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >
> > >>
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> > >>
> > >>>>
> > >>>>>
> > >>>>> This happens, as mentioned, every few minutes and
> looks
> > like
> > >> this when
> > >>>>
> > >>>>> condensed through grep:
> > >>>>>
> > >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >
> > >> nifi-app.log
> > >>>>
> > >>>>> | grep "ZooKeeperServer Shutting
> > >>>>> down"---------------------------------------------
>
> >
> > >>>>>
> > >>>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
>
> > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>>
> > >>>>>
> > >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> > >>
> > >>>>
> > >>>>>
> > >>>>> So my first question is: Is this behaviour of zookeeper
>
> > normal?
> > >> Well
> > >>>> it is
> > >>>>> just an INFO, but I would love to here that this
> is
> > a completely
> > >> normal
> > >>>>
> > >>>>> phenomenon. :D
> > >>>>>
> > >>>>> 2. After all these zookeeper restarts, nifi finally
>
> > throws the
> > >> following
> > >>>>
> > >>>>> error along with the "last" zookeeper
> restart:
> >
> > >>>>>
> > >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >
> > >>
> > >>>>> nifi-app.log------------------------------------------------------------------------------------
>
> >
> > >>
> > >>>>
> > >>>>>
> > >>>>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
>
> > >>>>> o.a.c.f.state.ConnectionStateManager State change:
> SUSPENDED
> >
> > >>
> > >>>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>
> > >>>>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.z.server.FinalRequestProcessor shutdown of
> request
> > processor
> > >> complete
> > >>>>
> > >>>>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.z.server.SyncRequestProcessor Shutting down
>
> > >>>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> > >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
>
> > exited!
> > >>
> > >>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
>
> > set to
> > >> 4000
> > >>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
>
> > set to
> > >> 40000
> > >>>>
> > >>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
>
> > with tickTime
> > >> 2000
> > >>>>
> > >>>>> minSessionTimeout 4000 maxSessionTimeout 40000
> datadir
> >
> > >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> > >>>>> /opt/nifi/state/zookeeper/version-2
> > >>>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server
> Thread-5]
> >
> > >>
> > >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed
> to
> > communicate
> > >> with
> > >>>> Peer
> > >>>>> nifiHost2.contoso.com/192.168.1.11:19500
> > >>>>> java.io.EOFException: Expected to receive a Transaction
>
> > Completion
> > >>
> > >>>>> Indicator from Peer nifiHost2.contoso.com but encountered
>
> > EOF
> > >>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >
> > >>
> > >>>>
> > >>>>>       at java.lang.Thread.run(Thread.java:748)
>
> > >>>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server
> Thread-6]
> >
> > >>
> > >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed
> to
> > communicate
> > >> with
> > >>>> Peer
> > >>>>> nifiHost2.contoso.com/192.168.1.11:19502
> > >>>>> java.io.EOFException: Expected to receive a Transaction
>
> > Completion
> > >>
> > >>>>> Indicator from Peer nifiHost2.contoso.com but encountered
>
> > EOF
> > >>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >
> > >>
> > >>>>
> > >>>>>       at
> > >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >
> > >>
> > >>>>
> > >>>>>       at java.lang.Thread.run(Thread.java:748)
>
> > >>>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol
>
> > Request-1]
> > >>
> > >>>>> o.a.n.c.c.node.NodeClusterCoordinator Status of
> nifiHost1.contoso.com:8443
> >
> > >>
> > >>>>
> > >>>>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>
> >
> > >>
> > >>>>
> > >>>>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
>
> >
> > >>
> > >>>>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
>
> >
> > >>>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol
>
> > Request-7]
> > >>
> > >>>>> o.a.n.c.p.impl.SocketProtocolListener Finished
> processing
> > request
> > >>
> > >>>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>
> >
> > >>
> > >>>>> length=2209040 bytes) from nifiHost1.contoso.com:8443
>
> > in 142
> > >> millis
> > >>>>
> > >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>
> >
> > >>>>> o.a.nifi.controller.StandardFlowService Processing
> reconnection
> >
> > >> request
> > >>>>
> > >>>>> from cluster coordinator.
> > >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>
> >
> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election
> Listener
> > for
> > >> Role
> > >>>> Primary
> > >>>>> Node disabled
> > >>>>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
>
> > >>>>> o.a.c.f.state.ConnectionStateManager State change:
> RECONNECTED
> >
> > >>
> > >>>>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
> > >>>>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>
> > >>>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
>
> > >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating
> checkpoint
> >
> > >> of FlowFile
> > >>>>
> > >>>>> Repository
> > >>>>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.z.server.FinalRequestProcessor shutdown of
> request
> > processor
> > >> complete
> > >>>>
> > >>>>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.z.server.SyncRequestProcessor Shutting down
>
> > >>>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> > >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
>
> > exited!
> > >>
> > >>>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process
> Thread-21]
> >
> > >>
> > >>>>> o.a.n.p.store.WriteAheadStorePartition Successfully
>
> > rolled over
> > >> Event
> > >>>>
> > >>>>> Writer for Provenance Event Store Partition[directory=/opt/pr]
>
> >
> > >> due to
> > >>>>
> > >>>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained
>
> > 1626 events.
> > >>
> > >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>
> > >>>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed
>
> > Write-Ahead
> > >> Log
> > >>>> with
> > >>>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world
>
> >
> > >> time =
> > >>>> 1
> > >>>>> milliseconds), max Transaction ID 27772177643
> > >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>
> > >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully
>
> > checkpointed
> > >> FlowFile
> > >>>>
> > >>>>> Repository with 10 records in 8 milliseconds
> > >>>>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
>
> > >>>>> o.a.c.f.state.ConnectionStateManager State change:
> SUSPENDED
> >
> > >>
> > >>>>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
>
> > set to
> > >> 4000
> > >>>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
>
> > set to
> > >> 40000
> > >>>>
> > >>>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >
> > >>
> > >>>>
> > >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
>
> > with tickTime
> > >> 2000
> > >>>>
> > >>>>> minSessionTimeout 4000 maxSessionTimeout 40000
> datadir
> >
> > >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> > >>>>> /opt/nifi/state/zookeeper/version-2
> > >>>>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
>
> > >>>>> o.a.c.f.state.ConnectionStateManager State change:
> RECONNECTED
> >
> > >>
> > >>>>>
> > >>>>>
> > >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> > >>
> > >>>>
> > >>>>>
> > >>>>> After that, the log continues with normal messages
> about
> > heartbeats
> > >>
> > >>>> and
> > >>>>> provenance events but no error occurs anymore.
> That
> > exactly
> > >> was the
> > >>>> time,
> > >>>>> the whole flow cut-off.
> > >>>>>
> > >>>>> I've got to add, that we've never experienced that
> flow
> > cut-off
> > >> behaviour
> > >>>>
> > >>>>> with version 1.12.1, but the zookeeper restarts
> occur
> > there
> > >> as well
> > >>>> and as
> > >>>>> often as seen above.
> > >>>>> But when first updating to version 1.13.2 back
> in June,
> > we experienced
> > >>
> > >>>>
> > >>>>> this flow cut-off multiple times beside the not
> working
> > load
> > >> balancing.
> > >>>>
> > >>>>> Back then I thought this was connected to each
> other,
> > but now
> > >> we had
> > >>>> a
> > >>>>> fully functional load balancing and the cut-off
> happened
> > nevertheless.
> > >>
> > >>>>
> > >>>>>
> > >>>>> I'm thankful for any hint. :)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> >
>
>
>



Re: Re: Disfunctional cluster with version 1.13.2

Posted by Axel Schwarz <Ax...@emailn.de>.
Hey Mark,

thanks again for your help. We just installed a dedicated Zookeeper Cluster. I will report our experiences regarding this topic as soon as we regained the courage to update to version 1.13.2 again :D

--- Ursprüngliche Nachricht ---
Von: Mark Payne <ma...@hotmail.com>
Datum: 23.08.2021 15:50:35
An: "users@nifi.apache.org" <us...@nifi.apache.org>, Axel Schwarz 	<Ax...@emailn.de>
Betreff: Re: Disfunctional cluster with version 1.13.2

> Hey Axel,
>
> Yes - it is definitely advisable to run zookeeper external to NiFi, on another
> server/VM. In fact - I would not recommend running any other service co-located
> with NiFi. NiFi is really designed to be quite greedy and make use of whatever
> hardware you allow it to. This can be a problem specifically with ZooKeeper.
> If a NiFi node gets really busy (due to a sudden burst of traffic, for example),
> it can easily starve Zookeeper of threads. This can happen due to over-utilization
> of the CPU, or it can happen due to a lot of garbage collection, etc. Typically
> this is a very temporary thing and NiFi happily recovers once the load has
> been worked down.
>
> But when this happens, if using an embedded ZK, all of the nodes in the NiFi
> cluster would start having trouble communicating with ZooKeeper. As a result,
> they will all start disconnecting and reconnecting. This, in turn, puts greater
> stress on the node that is already falling behind. And so begins a cycle
> that can lead to cluster instability. Running an external ZooKeeper on the
> same node is better than embedded because at least it doesn’t share the same
> JVM / heap / garbage collection, etc. But contention with disks & cpu
> can still cause the same issues. So running it elsewhere is definitely recommended.
>
>
> In terms of how many zookeeper servers to use - it really depends on what
> you need. A single Zookeeper instance is enough to get up & running but
> becomes a single point of failure, so typically 3 nodes are recommended for
> ZooKeeper (though the nodes don’t need to be particularly powerful). I don’t
> think adding any additional nodes provides any benefit, beyond making it
> more resilient if additional ZK nodes fail.
>
> Thanks
> -Mark
>
> > On Aug 23, 2021, at 4:30 AM, Axel Schwarz <Ax...@emailn.de>
> wrote:
> >
> > Hey Mark,
> >
> > thanks for your answer. We always had an external zookeeper running,
> but on the same hosts as the nifi instances. But as we were researching through
> the load balancing problem in version 1.13.2, we decided to use the embedded
> zookeeper, because in the documentation there is no hint at all to use an
> external zookeeper instance in production environment. So we just thought
> we should get rid of that possible source of error. Like incompatible versions
> etc.
> >
> > So just to be clear: It's officially advisable to run some external
> zookeeper instance on extra servers? If so, which version should we use?
> How many servers should a zookeeper cluster have to manage 3-5 Nifi-Hosts?
>
> >
> > Gotta say, I really appreciate all the work this community puts into
> helping others. I hope I can give something back some time.
> >
> > --- Ursprüngliche Nachricht ---
> > Von: Mark Payne <ma...@hotmail.com>
> > Datum: 17.08.2021 16:13:29
> > An: "users@nifi.apache.org" <us...@nifi.apache.org>,
> Axel Schwarz 	<Ax...@emailn.de>
> > Betreff: Re: Disfunctional cluster with version 1.13.2
> >
> >> Axel,
> >>
> >> So a few things to consider here.
> >>
> >> The NullPointer that you’re showing there in the logs was fixed
> in 1.14.0
> >> [1].
> >>
> >> Secondly, is a question of why the ZooKeeper client is disconnecting.
> Given
> >> that you’re running an embedded zookeeper server, it could be for
> any number
> >> of reasons - low java heap, too high CPU usage, over-utilized disks,
> etc.
> >> The embedded zookeeper is a convenience that should be used for
> testing &
> >> playing around locally but should never be used for production use
> - you
> >> should always depend on an external zookeeper for any production
> use case,
> >> running on a different machine (or set of machines). I suspect moving
> to
> >> an external zookeeper alone will be enough to help significantly
> here.
> >>
> >> But if you run into issues after moving to an external zookeeper,
> would recommend
> >> running “bin/nifi.sh diagnostics diag1.txt” when you’re having trouble.
> This
> >> will gather all sorts of information that you should be able to
> share, including
> >> garbage collection information, how many times the cluster coordinator
> is
> >> changing, etc. that can be helpful in diagnosing cluster-related
> issues.
> >>
> >>
> >> Thanks
> >> -Mark
> >>
> >> [1] https://issues.apache.org/jira/browse/NIFI-8394
> >>
> >>
> >>> On Aug 17, 2021, at 1:18 AM, Axel Schwarz <Ax...@emailn.de>
>
> >> wrote:
> >>>
> >>> Hey Pierre,
> >>>
> >>> we're currently using the embedded ZK. So as far as I know this
> is Version
> >> 3.5.6
> >>>
> >>> --- Ursprüngliche Nachricht ---
> >>> Von: Pierre Villard <pi...@gmail.com>
> >>> Datum: 16.08.2021 16:59:22
> >>> An: users@nifi.apache.org, Axel Schwarz <Ax...@emailn.de>
>
> >>
> >>> Betreff: Re: Disfunctional cluster with version 1.13.2
> >>>
> >>>> Hi,
> >>>>
> >>>> What's the version of ZK?
> >>>>
> >>>> Thanks,
> >>>> Pierre
> >>>>
> >>>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>
>
> >> a
> >>>> écrit :
> >>>>
> >>>>> Dear all,
> >>>>>
> >>>>> after successfully battling the load balancing and installing
>
> >> Version
> >>>>
> >>>>> 1.13.2 again in our 3 node production environment, we
> experienced
> >> another
> >>>>
> >>>>> failure in the cluster resulting in a complete cut-off
> of the
> >> flow just
> >>>>
> >>>>> 1,5h after the update.
> >>>>> We noticed it just by trying to access the webinterface,
> which
> >> immediately
> >>>>
> >>>>> after login showed something like:
> >>>>>
> >>>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443
>
> >> because
> >>>> the
> >>>>> node is not connected"
> >>>>>
> >>>>> There was nothing we could do through the webinterface
> aside
> >> from staring
> >>>>
> >>>>> at this message and when looking at the live logs, there
> was
> >> nothing
> >>>>
> >>>>> suspicious. The log moved on as if nothing happened.
>
> >>>>> After a restart of the cluster everything was working
> fine again,
> >> but
> >>>> we
> >>>>> saw, that the entire flow wasn't working for some period
> of
> >> time. This
> >>>>
> >>>>> alone is really uncool, as we running a cluster for
> exactly
> >> that reason:
> >>>>
> >>>>> The flow should keep working, even if some node decides
> to malfunction
> >>
> >>>> for
> >>>>> whatever reason.
> >>>>>
> >>>>> Digging a little deeper into the logs showed two noticable
> problems:
> >>
> >>>>
> >>>>>
> >>>>> 1. The Zookeeper is restarting every few minutes. Which
> in the
> >> log always
> >>>>
> >>>>> looks like this:
> >>>>>
> >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >>
> >>>>> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>>>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.FinalRequestProcessor shutdown of request
> processor
> >> complete
> >>>>
> >>>>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.SyncRequestProcessor Shutting down
> >>>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
> exited!
> >>
> >>>>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
> set to
> >> 4000
> >>>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
> set to
> >> 40000
> >>>>
> >>>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
> with tickTime
> >> 2000
> >>>>
> >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>
> >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>>>> /opt/nifi/state/zookeeper/version-2
> >>>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
> Thread-27-EventThread]
> >>
> >>>>
> >>>>> org.apache.zookeeper.ClientCnxn Error while calling
> watcher
> >>
> >>>>> java.lang.NullPointerException: null
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >>
> >>>>
> >>>>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>
> >>>>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >>
> >>>>
> >>>>> Connection State changed to SUSPENDED
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >>
> >>>>
> >>>>> Connection State changed to SUSPENDED
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >>
> >>>>
> >>>>> Connection State changed to RECONNECTED
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >>
> >>>>
> >>>>> Connection State changed to RECONNECTED
> >>>>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
> >>>>> o.a.c.framework.imps.EnsembleTracker New config event
> received:
> >> {}
> >>>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>
> >>>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created
> at 2021-08-11
> >>
> >>>>
> >>>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321
> at 2021-08-11
> >>
> >>>>> 12:02:40,127; send took 951 millis
> >>>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
> Thread-27-EventThread]
> >>
> >>>>
> >>>>> org.apache.zookeeper.ClientCnxn Error while calling
> watcher
> >>
> >>>>> java.lang.NullPointerException: null
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >>
> >>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> This happens, as mentioned, every few minutes and looks
> like
> >> this when
> >>>>
> >>>>> condensed through grep:
> >>>>>
> >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >> nifi-app.log
> >>>>
> >>>>> | grep "ZooKeeperServer Shutting
> >>>>> down"---------------------------------------------
>
> >>>>>
> >>>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> So my first question is: Is this behaviour of zookeeper
> normal?
> >> Well
> >>>> it is
> >>>>> just an INFO, but I would love to here that this is
> a completely
> >> normal
> >>>>
> >>>>> phenomenon. :D
> >>>>>
> >>>>> 2. After all these zookeeper restarts, nifi finally
> throws the
> >> following
> >>>>
> >>>>> error along with the "last" zookeeper restart:
>
> >>>>>
> >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >>
> >>>>> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>
> >>>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>>>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.FinalRequestProcessor shutdown of request
> processor
> >> complete
> >>>>
> >>>>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.SyncRequestProcessor Shutting down
> >>>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
> exited!
> >>
> >>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
> set to
> >> 4000
> >>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
> set to
> >> 40000
> >>>>
> >>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
> with tickTime
> >> 2000
> >>>>
> >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>
> >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>>>> /opt/nifi/state/zookeeper/version-2
> >>>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>
> >>
> >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to
> communicate
> >> with
> >>>> Peer
> >>>>> nifiHost2.contoso.com/192.168.1.11:19500
> >>>>> java.io.EOFException: Expected to receive a Transaction
> Completion
> >>
> >>>>> Indicator from Peer nifiHost2.contoso.com but encountered
> EOF
> >>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>
> >>>>
> >>>>>       at java.lang.Thread.run(Thread.java:748)
> >>>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>
> >>
> >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to
> communicate
> >> with
> >>>> Peer
> >>>>> nifiHost2.contoso.com/192.168.1.11:19502
> >>>>> java.io.EOFException: Expected to receive a Transaction
> Completion
> >>
> >>>>> Indicator from Peer nifiHost2.contoso.com but encountered
> EOF
> >>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>
> >>>>
> >>>>>       at java.lang.Thread.run(Thread.java:748)
> >>>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol
> Request-1]
> >>
> >>>>> o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443
>
> >>
> >>>>
> >>>>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>
> >>
> >>>>
> >>>>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
>
> >>
> >>>>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
>
> >>>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol
> Request-7]
> >>
> >>>>> o.a.n.c.p.impl.SocketProtocolListener Finished processing
> request
> >>
> >>>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>
> >>
> >>>>> length=2209040 bytes) from nifiHost1.contoso.com:8443
> in 142
> >> millis
> >>>>
> >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>
> >>>>> o.a.nifi.controller.StandardFlowService Processing reconnection
>
> >> request
> >>>>
> >>>>> from cluster coordinator.
> >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener
> for
> >> Role
> >>>> Primary
> >>>>> Node disabled
> >>>>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>
> >>>>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
> >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint
>
> >> of FlowFile
> >>>>
> >>>>> Repository
> >>>>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.FinalRequestProcessor shutdown of request
> processor
> >> complete
> >>>>
> >>>>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.SyncRequestProcessor Shutting down
> >>>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
> exited!
> >>
> >>>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>
> >>
> >>>>> o.a.n.p.store.WriteAheadStorePartition Successfully
> rolled over
> >> Event
> >>>>
> >>>>> Writer for Provenance Event Store Partition[directory=/opt/pr]
>
> >> due to
> >>>>
> >>>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained
> 1626 events.
> >>
> >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >>>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed
> Write-Ahead
> >> Log
> >>>> with
> >>>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world
>
> >> time =
> >>>> 1
> >>>>> milliseconds), max Transaction ID 27772177643
> >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully
> checkpointed
> >> FlowFile
> >>>>
> >>>>> Repository with 10 records in 8 milliseconds
> >>>>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>
> >>>>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
> set to
> >> 4000
> >>>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
> set to
> >> 40000
> >>>>
> >>>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
> with tickTime
> >> 2000
> >>>>
> >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>
> >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>>>> /opt/nifi/state/zookeeper/version-2
> >>>>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> After that, the log continues with normal messages about
> heartbeats
> >>
> >>>> and
> >>>>> provenance events but no error occurs anymore. That
> exactly
> >> was the
> >>>> time,
> >>>>> the whole flow cut-off.
> >>>>>
> >>>>> I've got to add, that we've never experienced that flow
> cut-off
> >> behaviour
> >>>>
> >>>>> with version 1.12.1, but the zookeeper restarts occur
> there
> >> as well
> >>>> and as
> >>>>> often as seen above.
> >>>>> But when first updating to version 1.13.2 back in June,
> we experienced
> >>
> >>>>
> >>>>> this flow cut-off multiple times beside the not working
> load
> >> balancing.
> >>>>
> >>>>> Back then I thought this was connected to each other,
> but now
> >> we had
> >>>> a
> >>>>> fully functional load balancing and the cut-off happened
> nevertheless.
> >>
> >>>>
> >>>>>
> >>>>> I'm thankful for any hint. :)
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>



Re: Disfunctional cluster with version 1.13.2

Posted by Mark Payne <ma...@hotmail.com>.
Hey Axel,

Yes - it is definitely advisable to run zookeeper external to NiFi, on another server/VM. In fact - I would not recommend running any other service co-located with NiFi. NiFi is really designed to be quite greedy and make use of whatever hardware you allow it to. This can be a problem specifically with ZooKeeper. If a NiFi node gets really busy (due to a sudden burst of traffic, for example), it can easily starve Zookeeper of threads. This can happen due to over-utilization of the CPU, or it can happen due to a lot of garbage collection, etc. Typically this is a very temporary thing and NiFi happily recovers once the load has been worked down.

But when this happens, if using an embedded ZK, all of the nodes in the NiFi cluster would start having trouble communicating with ZooKeeper. As a result, they will all start disconnecting and reconnecting. This, in turn, puts greater stress on the node that is already falling behind. And so begins a cycle that can lead to cluster instability. Running an external ZooKeeper on the same node is better than embedded because at least it doesn’t share the same JVM / heap / garbage collection, etc. But contention with disks & cpu can still cause the same issues. So running it elsewhere is definitely recommended.

In terms of how many zookeeper servers to use - it really depends on what you need. A single Zookeeper instance is enough to get up & running but becomes a single point of failure, so typically 3 nodes are recommended for ZooKeeper (though the nodes don’t need to be particularly powerful). I don’t think adding any additional nodes provides any benefit, beyond making it more resilient if additional ZK nodes fail.

Thanks
-Mark

> On Aug 23, 2021, at 4:30 AM, Axel Schwarz <Ax...@emailn.de> wrote:
> 
> Hey Mark,
> 
> thanks for your answer. We always had an external zookeeper running, but on the same hosts as the nifi instances. But as we were researching through the load balancing problem in version 1.13.2, we decided to use the embedded zookeeper, because in the documentation there is no hint at all to use an external zookeeper instance in production environment. So we just thought we should get rid of that possible source of error. Like incompatible versions etc.
> 
> So just to be clear: It's officially advisable to run some external zookeeper instance on extra servers? If so, which version should we use? How many servers should a zookeeper cluster have to manage 3-5 Nifi-Hosts?
> 
> Gotta say, I really appreciate all the work this community puts into helping others. I hope I can give something back some time.
> 
> --- Ursprüngliche Nachricht ---
> Von: Mark Payne <ma...@hotmail.com>
> Datum: 17.08.2021 16:13:29
> An: "users@nifi.apache.org" <us...@nifi.apache.org>, Axel Schwarz 	<Ax...@emailn.de>
> Betreff: Re: Disfunctional cluster with version 1.13.2
> 
>> Axel,
>> 
>> So a few things to consider here.
>> 
>> The NullPointer that you’re showing there in the logs was fixed in 1.14.0
>> [1].
>> 
>> Secondly, is a question of why the ZooKeeper client is disconnecting. Given
>> that you’re running an embedded zookeeper server, it could be for any number
>> of reasons - low java heap, too high CPU usage, over-utilized disks, etc.
>> The embedded zookeeper is a convenience that should be used for testing &
>> playing around locally but should never be used for production use - you
>> should always depend on an external zookeeper for any production use case,
>> running on a different machine (or set of machines). I suspect moving to
>> an external zookeeper alone will be enough to help significantly here.
>> 
>> But if you run into issues after moving to an external zookeeper, would recommend
>> running “bin/nifi.sh diagnostics diag1.txt” when you’re having trouble. This
>> will gather all sorts of information that you should be able to share, including
>> garbage collection information, how many times the cluster coordinator is
>> changing, etc. that can be helpful in diagnosing cluster-related issues.
>> 
>> 
>> Thanks
>> -Mark
>> 
>> [1] https://issues.apache.org/jira/browse/NIFI-8394
>> 
>> 
>>> On Aug 17, 2021, at 1:18 AM, Axel Schwarz <Ax...@emailn.de>
>> wrote:
>>> 
>>> Hey Pierre,
>>> 
>>> we're currently using the embedded ZK. So as far as I know this is Version
>> 3.5.6
>>> 
>>> --- Ursprüngliche Nachricht ---
>>> Von: Pierre Villard <pi...@gmail.com>
>>> Datum: 16.08.2021 16:59:22
>>> An: users@nifi.apache.org, Axel Schwarz <Ax...@emailn.de>
>> 
>>> Betreff: Re: Disfunctional cluster with version 1.13.2
>>> 
>>>> Hi,
>>>> 
>>>> What's the version of ZK?
>>>> 
>>>> Thanks,
>>>> Pierre
>>>> 
>>>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>
>> a
>>>> écrit :
>>>> 
>>>>> Dear all,
>>>>> 
>>>>> after successfully battling the load balancing and installing
>> Version
>>>> 
>>>>> 1.13.2 again in our 3 node production environment, we experienced
>> another
>>>> 
>>>>> failure in the cluster resulting in a complete cut-off of the
>> flow just
>>>> 
>>>>> 1,5h after the update.
>>>>> We noticed it just by trying to access the webinterface, which
>> immediately
>>>> 
>>>>> after login showed something like:
>>>>> 
>>>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443
>> because
>>>> the
>>>>> node is not connected"
>>>>> 
>>>>> There was nothing we could do through the webinterface aside
>> from staring
>>>> 
>>>>> at this message and when looking at the live logs, there was
>> nothing
>>>> 
>>>>> suspicious. The log moved on as if nothing happened.
>>>>> After a restart of the cluster everything was working fine again,
>> but
>>>> we
>>>>> saw, that the entire flow wasn't working for some period of
>> time. This
>>>> 
>>>>> alone is really uncool, as we running a cluster for exactly
>> that reason:
>>>> 
>>>>> The flow should keep working, even if some node decides to malfunction
>> 
>>>> for
>>>>> whatever reason.
>>>>> 
>>>>> Digging a little deeper into the logs showed two noticable problems:
>> 
>>>> 
>>>>> 
>>>>> 1. The Zookeeper is restarting every few minutes. Which in the
>> log always
>>>> 
>>>>> looks like this:
>>>>> 
>>>>> ------------------------------------(nifiHost1.contoso.com)
>> 
>>>>> nifi-app.log------------------------------------------------------------------------------------
>> 
>>>> 
>>>>> 
>>>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>>>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.z.server.FinalRequestProcessor shutdown of request processor
>> complete
>>>> 
>>>>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.z.server.SyncRequestProcessor Shutting down
>>>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
>>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> 
>>>>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to
>> 4000
>>>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to
>> 40000
>>>> 
>>>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
>> 2000
>>>> 
>>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>>>> /opt/nifi/state/zookeeper/version-2
>>>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread]
>> 
>>>> 
>>>>> org.apache.zookeeper.ClientCnxn Error while calling watcher
>> 
>>>>> java.lang.NullPointerException: null
>>>>>       at
>>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 
>>>> 
>>>>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
>>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> 
>>>>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
>>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> 
>>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> 
>>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> 
>>>> 
>>>>> Connection State changed to SUSPENDED
>>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> 
>>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> 
>>>> 
>>>>> Connection State changed to SUSPENDED
>>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> 
>>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> 
>>>> 
>>>>> Connection State changed to RECONNECTED
>>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>> 
>>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> 
>>>> 
>>>>> Connection State changed to RECONNECTED
>>>>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
>>>>> o.a.c.framework.imps.EnsembleTracker New config event received:
>> {}
>>>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>>>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>> 
>>>> 
>>>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
>> 
>>>>> 12:02:40,127; send took 951 millis
>>>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread]
>> 
>>>> 
>>>>> org.apache.zookeeper.ClientCnxn Error while calling watcher
>> 
>>>>> java.lang.NullPointerException: null
>>>>>       at
>>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 
>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>>> 
>>>>> 
>>>>> This happens, as mentioned, every few minutes and looks like
>> this when
>>>> 
>>>>> condensed through grep:
>>>>> 
>>>>> ------------------------------------(nifiHost1.contoso.com)
>> nifi-app.log
>>>> 
>>>>> | grep "ZooKeeperServer Shutting
>>>>> down"---------------------------------------------
>>>>> 
>>>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
>>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>>> 
>>>>> 
>>>>> So my first question is: Is this behaviour of zookeeper normal?
>> Well
>>>> it is
>>>>> just an INFO, but I would love to here that this is a completely
>> normal
>>>> 
>>>>> phenomenon. :D
>>>>> 
>>>>> 2. After all these zookeeper restarts, nifi finally throws the
>> following
>>>> 
>>>>> error along with the "last" zookeeper restart:
>>>>> 
>>>>> ------------------------------------(nifiHost1.contoso.com)
>> 
>>>>> nifi-app.log------------------------------------------------------------------------------------
>> 
>>>> 
>>>>> 
>>>>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
>>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> 
>>>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>>>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.z.server.FinalRequestProcessor shutdown of request processor
>> complete
>>>> 
>>>>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.z.server.SyncRequestProcessor Shutting down
>>>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
>>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> 
>>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to
>> 4000
>>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to
>> 40000
>>>> 
>>>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
>> 2000
>>>> 
>>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>>>> /opt/nifi/state/zookeeper/version-2
>>>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>> 
>>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
>> with
>>>> Peer
>>>>> nifiHost2.contoso.com/192.168.1.11:19500
>>>>> java.io.EOFException: Expected to receive a Transaction Completion
>> 
>>>>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>> 
>>>> 
>>>>>       at java.lang.Thread.run(Thread.java:748)
>>>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>> 
>>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
>> with
>>>> Peer
>>>>> nifiHost2.contoso.com/192.168.1.11:19502
>>>>> java.io.EOFException: Expected to receive a Transaction Completion
>> 
>>>>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>> 
>>>> 
>>>>>       at
>>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>> 
>>>> 
>>>>>       at java.lang.Thread.run(Thread.java:748)
>>>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>> 
>>>>> o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443
>> 
>>>> 
>>>>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>> 
>>>> 
>>>>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
>> 
>>>>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
>>>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>> 
>>>>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>> 
>>>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>> 
>>>>> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142
>> millis
>>>> 
>>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>>>> o.a.nifi.controller.StandardFlowService Processing reconnection
>> request
>>>> 
>>>>> from cluster coordinator.
>>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for
>> Role
>>>> Primary
>>>>> Node disabled
>>>>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
>>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> 
>>>>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>>>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
>>>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint
>> of FlowFile
>>>> 
>>>>> Repository
>>>>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.z.server.FinalRequestProcessor shutdown of request processor
>> complete
>>>> 
>>>>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.z.server.SyncRequestProcessor Shutting down
>>>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
>>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>> 
>>>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>> 
>>>>> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over
>> Event
>>>> 
>>>>> Writer for Provenance Event Store Partition[directory=/opt/pr]
>> due to
>>>> 
>>>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>> 
>>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead
>> Log
>>>> with
>>>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world
>> time =
>>>> 1
>>>>> milliseconds), max Transaction ID 27772177643
>>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed
>> FlowFile
>>>> 
>>>>> Repository with 10 records in 8 milliseconds
>>>>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
>>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>> 
>>>>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to
>> 4000
>>>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to
>> 40000
>>>> 
>>>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>>> 
>>>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
>> 2000
>>>> 
>>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>>>> /opt/nifi/state/zookeeper/version-2
>>>>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
>>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>>> 
>>>>> 
>>>>> After that, the log continues with normal messages about heartbeats
>> 
>>>> and
>>>>> provenance events but no error occurs anymore. That exactly
>> was the
>>>> time,
>>>>> the whole flow cut-off.
>>>>> 
>>>>> I've got to add, that we've never experienced that flow cut-off
>> behaviour
>>>> 
>>>>> with version 1.12.1, but the zookeeper restarts occur there
>> as well
>>>> and as
>>>>> often as seen above.
>>>>> But when first updating to version 1.13.2 back in June, we experienced
>> 
>>>> 
>>>>> this flow cut-off multiple times beside the not working load
>> balancing.
>>>> 
>>>>> Back then I thought this was connected to each other, but now
>> we had
>>>> a
>>>>> fully functional load balancing and the cut-off happened nevertheless.
>> 
>>>> 
>>>>> 
>>>>> I'm thankful for any hint. :)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 


Re: Re: Disfunctional cluster with version 1.13.2

Posted by Axel Schwarz <Ax...@emailn.de>.
Hey Mark,

thanks for your answer. We always had an external zookeeper running, but on the same hosts as the nifi instances. But as we were researching through the load balancing problem in version 1.13.2, we decided to use the embedded zookeeper, because in the documentation there is no hint at all to use an external zookeeper instance in production environment. So we just thought we should get rid of that possible source of error. Like incompatible versions etc.

So just to be clear: It's officially advisable to run some external zookeeper instance on extra servers? If so, which version should we use? How many servers should a zookeeper cluster have to manage 3-5 Nifi-Hosts?

Gotta say, I really appreciate all the work this community puts into helping others. I hope I can give something back some time.

--- Ursprüngliche Nachricht ---
Von: Mark Payne <ma...@hotmail.com>
Datum: 17.08.2021 16:13:29
An: "users@nifi.apache.org" <us...@nifi.apache.org>, Axel Schwarz 	<Ax...@emailn.de>
Betreff: Re: Disfunctional cluster with version 1.13.2

> Axel,
>
> So a few things to consider here.
>
> The NullPointer that you’re showing there in the logs was fixed in 1.14.0
> [1].
>
> Secondly, is a question of why the ZooKeeper client is disconnecting. Given
> that you’re running an embedded zookeeper server, it could be for any number
> of reasons - low java heap, too high CPU usage, over-utilized disks, etc.
> The embedded zookeeper is a convenience that should be used for testing &
> playing around locally but should never be used for production use - you
> should always depend on an external zookeeper for any production use case,
> running on a different machine (or set of machines). I suspect moving to
> an external zookeeper alone will be enough to help significantly here.
>
> But if you run into issues after moving to an external zookeeper, would recommend
> running “bin/nifi.sh diagnostics diag1.txt” when you’re having trouble. This
> will gather all sorts of information that you should be able to share, including
> garbage collection information, how many times the cluster coordinator is
> changing, etc. that can be helpful in diagnosing cluster-related issues.
>
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-8394
>
>
> > On Aug 17, 2021, at 1:18 AM, Axel Schwarz <Ax...@emailn.de>
> wrote:
> >
> > Hey Pierre,
> >
> > we're currently using the embedded ZK. So as far as I know this is Version
> 3.5.6
> >
> > --- Ursprüngliche Nachricht ---
> > Von: Pierre Villard <pi...@gmail.com>
> > Datum: 16.08.2021 16:59:22
> > An: users@nifi.apache.org, Axel Schwarz <Ax...@emailn.de>
>
> > Betreff: Re: Disfunctional cluster with version 1.13.2
> >
> >> Hi,
> >>
> >> What's the version of ZK?
> >>
> >> Thanks,
> >> Pierre
> >>
> >> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de>
> a
> >> écrit :
> >>
> >>> Dear all,
> >>>
> >>> after successfully battling the load balancing and installing
> Version
> >>
> >>> 1.13.2 again in our 3 node production environment, we experienced
> another
> >>
> >>> failure in the cluster resulting in a complete cut-off of the
> flow just
> >>
> >>> 1,5h after the update.
> >>> We noticed it just by trying to access the webinterface, which
> immediately
> >>
> >>> after login showed something like:
> >>>
> >>> "Cannot replicate request to Node nifiHost1.contoso.com:8443
> because
> >> the
> >>> node is not connected"
> >>>
> >>> There was nothing we could do through the webinterface aside
> from staring
> >>
> >>> at this message and when looking at the live logs, there was
> nothing
> >>
> >>> suspicious. The log moved on as if nothing happened.
> >>> After a restart of the cluster everything was working fine again,
> but
> >> we
> >>> saw, that the entire flow wasn't working for some period of
> time. This
> >>
> >>> alone is really uncool, as we running a cluster for exactly
> that reason:
> >>
> >>> The flow should keep working, even if some node decides to malfunction
>
> >> for
> >>> whatever reason.
> >>>
> >>> Digging a little deeper into the logs showed two noticable problems:
>
> >>
> >>>
> >>> 1. The Zookeeper is restarting every few minutes. Which in the
> log always
> >>
> >>> looks like this:
> >>>
> >>> ------------------------------------(nifiHost1.contoso.com)
>
> >>> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >>>
> >>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.z.server.FinalRequestProcessor shutdown of request processor
> complete
> >>
> >>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.z.server.SyncRequestProcessor Shutting down
> >>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> >>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>
> >>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to
> 4000
> >>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to
> 40000
> >>
> >>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> 2000
> >>
> >>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> >>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>> /opt/nifi/state/zookeeper/version-2
> >>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread]
>
> >>
> >>> org.apache.zookeeper.ClientCnxn Error while calling watcher
>
> >>> java.lang.NullPointerException: null
> >>>        at
> >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>
> >>>        at
> >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >>
> >>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
> >>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
> >>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >>
> >>> Connection State changed to SUSPENDED
> >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >>
> >>> Connection State changed to SUSPENDED
> >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >>
> >>> Connection State changed to RECONNECTED
> >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >>
> >>> Connection State changed to RECONNECTED
> >>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
> >>> o.a.c.framework.imps.EnsembleTracker New config event received:
> {}
> >>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
> >>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>
> >>
> >>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
>
> >>> 12:02:40,127; send took 951 millis
> >>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread]
>
> >>
> >>> org.apache.zookeeper.ClientCnxn Error while calling watcher
>
> >>> java.lang.NullPointerException: null
> >>>        at
> >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>
> >>>        at
> >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >>
> >>>
> >>>
> >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>
> >>> This happens, as mentioned, every few minutes and looks like
> this when
> >>
> >>> condensed through grep:
> >>>
> >>> ------------------------------------(nifiHost1.contoso.com)
> nifi-app.log
> >>
> >>> | grep "ZooKeeperServer Shutting
> >>> down"---------------------------------------------
> >>>
> >>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
> >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>
> >>>
> >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>
> >>> So my first question is: Is this behaviour of zookeeper normal?
> Well
> >> it is
> >>> just an INFO, but I would love to here that this is a completely
> normal
> >>
> >>> phenomenon. :D
> >>>
> >>> 2. After all these zookeeper restarts, nifi finally throws the
> following
> >>
> >>> error along with the "last" zookeeper restart:
> >>>
> >>> ------------------------------------(nifiHost1.contoso.com)
>
> >>> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >>>
> >>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
> >>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.z.server.FinalRequestProcessor shutdown of request processor
> complete
> >>
> >>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.z.server.SyncRequestProcessor Shutting down
> >>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> >>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>
> >>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to
> 4000
> >>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to
> 40000
> >>
> >>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> 2000
> >>
> >>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> >>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>> /opt/nifi/state/zookeeper/version-2
> >>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>
> >>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
> with
> >> Peer
> >>> nifiHost2.contoso.com/192.168.1.11:19500
> >>> java.io.EOFException: Expected to receive a Transaction Completion
>
> >>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>
> >>>        at java.lang.Thread.run(Thread.java:748)
> >>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>
> >>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate
> with
> >> Peer
> >>> nifiHost2.contoso.com/192.168.1.11:19502
> >>> java.io.EOFException: Expected to receive a Transaction Completion
>
> >>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>
> >>>        at
> >>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>
> >>>        at java.lang.Thread.run(Thread.java:748)
> >>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>
> >>> o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443
>
> >>
> >>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>
> >>
> >>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
>
> >>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
> >>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>
> >>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>
> >>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>
> >>> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142
> millis
> >>
> >>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> >>> o.a.nifi.controller.StandardFlowService Processing reconnection
> request
> >>
> >>> from cluster coordinator.
> >>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> >>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for
> Role
> >> Primary
> >>> Node disabled
> >>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
> >>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
> >>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint
> of FlowFile
> >>
> >>> Repository
> >>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.z.server.FinalRequestProcessor shutdown of request processor
> complete
> >>
> >>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.z.server.SyncRequestProcessor Shutting down
> >>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> >>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>
> >>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>
> >>> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over
> Event
> >>
> >>> Writer for Provenance Event Store Partition[directory=/opt/pr]
> due to
> >>
> >>> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>
> >>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead
> Log
> >> with
> >>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world
> time =
> >> 1
> >>> milliseconds), max Transaction ID 27772177643
> >>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed
> FlowFile
> >>
> >>> Repository with 10 records in 8 milliseconds
> >>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
> >>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to
> 4000
> >>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to
> 40000
> >>
> >>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime
> 2000
> >>
> >>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> >>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>> /opt/nifi/state/zookeeper/version-2
> >>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
> >>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>>
> >>>
> >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>
> >>> After that, the log continues with normal messages about heartbeats
>
> >> and
> >>> provenance events but no error occurs anymore. That exactly
> was the
> >> time,
> >>> the whole flow cut-off.
> >>>
> >>> I've got to add, that we've never experienced that flow cut-off
> behaviour
> >>
> >>> with version 1.12.1, but the zookeeper restarts occur there
> as well
> >> and as
> >>> often as seen above.
> >>> But when first updating to version 1.13.2 back in June, we experienced
>
> >>
> >>> this flow cut-off multiple times beside the not working load
> balancing.
> >>
> >>> Back then I thought this was connected to each other, but now
> we had
> >> a
> >>> fully functional load balancing and the cut-off happened nevertheless.
>
> >>
> >>>
> >>> I'm thankful for any hint. :)
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
>



Re: Disfunctional cluster with version 1.13.2

Posted by Mark Payne <ma...@hotmail.com>.
Axel,

So a few things to consider here.

The NullPointer that you’re showing there in the logs was fixed in 1.14.0 [1].

Secondly, is a question of why the ZooKeeper client is disconnecting. Given that you’re running an embedded zookeeper server, it could be for any number of reasons - low java heap, too high CPU usage, over-utilized disks, etc. The embedded zookeeper is a convenience that should be used for testing & playing around locally but should never be used for production use - you should always depend on an external zookeeper for any production use case, running on a different machine (or set of machines). I suspect moving to an external zookeeper alone will be enough to help significantly here.

But if you run into issues after moving to an external zookeeper, would recommend running “bin/nifi.sh diagnostics diag1.txt” when you’re having trouble. This will gather all sorts of information that you should be able to share, including garbage collection information, how many times the cluster coordinator is changing, etc. that can be helpful in diagnosing cluster-related issues.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-8394


> On Aug 17, 2021, at 1:18 AM, Axel Schwarz <Ax...@emailn.de> wrote:
> 
> Hey Pierre,
> 
> we're currently using the embedded ZK. So as far as I know this is Version 3.5.6
> 
> --- Ursprüngliche Nachricht ---
> Von: Pierre Villard <pi...@gmail.com>
> Datum: 16.08.2021 16:59:22
> An: users@nifi.apache.org, Axel Schwarz <Ax...@emailn.de>
> Betreff: Re: Disfunctional cluster with version 1.13.2
> 
>> Hi,
>> 
>> What's the version of ZK?
>> 
>> Thanks,
>> Pierre
>> 
>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de> a
>> écrit :
>> 
>>> Dear all,
>>> 
>>> after successfully battling the load balancing and installing Version
>> 
>>> 1.13.2 again in our 3 node production environment, we experienced another
>> 
>>> failure in the cluster resulting in a complete cut-off of the flow just
>> 
>>> 1,5h after the update.
>>> We noticed it just by trying to access the webinterface, which immediately
>> 
>>> after login showed something like:
>>> 
>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443 because
>> the
>>> node is not connected"
>>> 
>>> There was nothing we could do through the webinterface aside from staring
>> 
>>> at this message and when looking at the live logs, there was nothing
>> 
>>> suspicious. The log moved on as if nothing happened.
>>> After a restart of the cluster everything was working fine again, but
>> we
>>> saw, that the entire flow wasn't working for some period of time. This
>> 
>>> alone is really uncool, as we running a cluster for exactly that reason:
>> 
>>> The flow should keep working, even if some node decides to malfunction
>> for
>>> whatever reason.
>>> 
>>> Digging a little deeper into the logs showed two noticable problems:
>> 
>>> 
>>> 1. The Zookeeper is restarting every few minutes. Which in the log always
>> 
>>> looks like this:
>>> 
>>> ------------------------------------(nifiHost1.contoso.com)
>>> nifi-app.log------------------------------------------------------------------------------------
>> 
>>> 
>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>> 
>>> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> 
>>> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>> 
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread]
>> 
>>> org.apache.zookeeper.ClientCnxn Error while calling watcher
>>> java.lang.NullPointerException: null
>>>        at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>> 
>>>        at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 
>>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> 
>>> Connection State changed to SUSPENDED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> 
>>> Connection State changed to SUSPENDED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>> 
>>> Connection State changed to RECONNECTED
>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager
>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>> 
>>> Connection State changed to RECONNECTED
>>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
>>> o.a.c.framework.imps.EnsembleTracker New config event received: {}
>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>> 
>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
>>> 12:02:40,127; send took 951 millis
>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread]
>> 
>>> org.apache.zookeeper.ClientCnxn Error while calling watcher
>>> java.lang.NullPointerException: null
>>>        at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>> 
>>>        at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>> 
>>> This happens, as mentioned, every few minutes and looks like this when
>> 
>>> condensed through grep:
>>> 
>>> ------------------------------------(nifiHost1.contoso.com) nifi-app.log
>> 
>>> | grep "ZooKeeperServer Shutting
>>> down"---------------------------------------------
>>> 
>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>> 
>>> So my first question is: Is this behaviour of zookeeper normal? Well
>> it is
>>> just an INFO, but I would love to here that this is a completely normal
>> 
>>> phenomenon. :D
>>> 
>>> 2. After all these zookeeper restarts, nifi finally throws the following
>> 
>>> error along with the "last" zookeeper restart:
>>> 
>>> ------------------------------------(nifiHost1.contoso.com)
>>> nifi-app.log------------------------------------------------------------------------------------
>> 
>>> 
>>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>> 
>>> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> 
>>> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>> 
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with
>> Peer
>>> nifiHost2.contoso.com/192.168.1.11:19500
>>> java.io.EOFException: Expected to receive a Transaction Completion
>>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>> 
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>> 
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>> 
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>> 
>>>        at java.lang.Thread.run(Thread.java:748)
>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with
>> Peer
>>> nifiHost2.contoso.com/192.168.1.11:19502
>>> java.io.EOFException: Expected to receive a Transaction Completion
>>> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>> 
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>> 
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>> 
>>>        at
>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>> 
>>>        at java.lang.Thread.run(Thread.java:748)
>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
>>> o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443
>> 
>>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>> 
>>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
>>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
>>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>>> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
>> 
>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>> o.a.nifi.controller.StandardFlowService Processing reconnection request
>> 
>>> from cluster coordinator.
>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role
>> Primary
>>> Node disabled
>>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
>>> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer shutting down
>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
>> 
>>> Repository
>>> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>> 
>>> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.z.server.SyncRequestProcessor Shutting down
>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>>> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event
>> 
>>> Writer for Provenance Event Store Partition[directory=/opt/pr] due to
>> 
>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log
>> with
>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time =
>> 1
>>> milliseconds), max Transaction ID 27772177643
>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
>> 
>>> Repository with 10 records in 8 milliseconds
>>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>>> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>> 
>>> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>> 
>>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>> 
>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>>> /opt/nifi/logs/zookeeper/version-2 snapdir
>>> /opt/nifi/state/zookeeper/version-2
>>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>>> 
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>> 
>>> After that, the log continues with normal messages about heartbeats
>> and
>>> provenance events but no error occurs anymore. That exactly was the
>> time,
>>> the whole flow cut-off.
>>> 
>>> I've got to add, that we've never experienced that flow cut-off behaviour
>> 
>>> with version 1.12.1, but the zookeeper restarts occur there as well
>> and as
>>> often as seen above.
>>> But when first updating to version 1.13.2 back in June, we experienced
>> 
>>> this flow cut-off multiple times beside the not working load balancing.
>> 
>>> Back then I thought this was connected to each other, but now we had
>> a
>>> fully functional load balancing and the cut-off happened nevertheless.
>> 
>>> 
>>> I'm thankful for any hint. :)
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 


Re: Re: Disfunctional cluster with version 1.13.2

Posted by Axel Schwarz <Ax...@emailn.de>.
Hey Pierre,

we're currently using the embedded ZK. So as far as I know this is Version 3.5.6

--- Ursprüngliche Nachricht ---
Von: Pierre Villard <pi...@gmail.com>
Datum: 16.08.2021 16:59:22
An: users@nifi.apache.org, Axel Schwarz <Ax...@emailn.de>
Betreff: Re: Disfunctional cluster with version 1.13.2

> Hi,
>
> What's the version of ZK?
>
> Thanks,
> Pierre
>
> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de> a
> écrit :
>
> > Dear all,
> >
> > after successfully battling the load balancing and installing Version
>
> > 1.13.2 again in our 3 node production environment, we experienced another
>
> > failure in the cluster resulting in a complete cut-off of the flow just
>
> > 1,5h after the update.
> > We noticed it just by trying to access the webinterface, which immediately
>
> > after login showed something like:
> >
> > "Cannot replicate request to Node nifiHost1.contoso.com:8443 because
> the
> > node is not connected"
> >
> > There was nothing we could do through the webinterface aside from staring
>
> > at this message and when looking at the live logs, there was nothing
>
> > suspicious. The log moved on as if nothing happened.
> > After a restart of the cluster everything was working fine again, but
> we
> > saw, that the entire flow wasn't working for some period of time. This
>
> > alone is really uncool, as we running a cluster for exactly that reason:
>
> > The flow should keep working, even if some node decides to malfunction
> for
> > whatever reason.
> >
> > Digging a little deeper into the logs showed two noticable problems:
>
> >
> > 1. The Zookeeper is restarting every few minutes. Which in the log always
>
> > looks like this:
> >
> > ------------------------------------(nifiHost1.contoso.com)
> > nifi-app.log------------------------------------------------------------------------------------
>
> >
> > 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer shutting down
> > 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>
> > 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.z.server.SyncRequestProcessor Shutting down
> > 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> > o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> > 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> > 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>
> > 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>
> > minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> > /opt/nifi/logs/zookeeper/version-2 snapdir
> > /opt/nifi/state/zookeeper/version-2
> > 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread]
>
> > org.apache.zookeeper.ClientCnxn Error while calling watcher
> > java.lang.NullPointerException: null
> >         at
> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >         at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> > 2021-08-11 12:02:39,277 INFO [main-EventThread]
> > o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> > 2021-08-11 12:02:40,001 INFO [main-EventThread]
> > o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> > 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> > o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> > Connection State changed to SUSPENDED
> > 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> > o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> > Connection State changed to SUSPENDED
> > 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> > o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> > Connection State changed to RECONNECTED
> > 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> > o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> > Connection State changed to RECONNECTED
> > 2021-08-11 12:02:40,014 INFO [main-EventThread]
> > o.a.c.framework.imps.EnsembleTracker New config event received: {}
> > 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
> > o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
>
> > 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
> > 12:02:40,127; send took 951 millis
> > 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread]
>
> > org.apache.zookeeper.ClientCnxn Error while calling watcher
> > java.lang.NullPointerException: null
> >         at
> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >         at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >
> >
> > ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> > This happens, as mentioned, every few minutes and looks like this when
>
> > condensed through grep:
> >
> > ------------------------------------(nifiHost1.contoso.com) nifi-app.log
>
> > | grep "ZooKeeperServer Shutting
> > down"---------------------------------------------
> >
> > ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
> > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> >
> >
> > ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> > So my first question is: Is this behaviour of zookeeper normal? Well
> it is
> > just an INFO, but I would love to here that this is a completely normal
>
> > phenomenon. :D
> >
> > 2. After all these zookeeper restarts, nifi finally throws the following
>
> > error along with the "last" zookeeper restart:
> >
> > ------------------------------------(nifiHost1.contoso.com)
> > nifi-app.log------------------------------------------------------------------------------------
>
> >
> > 2021-08-11 13:31:13,083 INFO [main-EventThread]
> > o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> > 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer shutting down
> > 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>
> > 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.z.server.SyncRequestProcessor Shutting down
> > 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> > o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> > 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> > 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>
> > 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>
> > minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> > /opt/nifi/logs/zookeeper/version-2 snapdir
> > /opt/nifi/state/zookeeper/version-2
> > 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
> > o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with
> Peer
> > nifiHost2.contoso.com/192.168.1.11:19500
> > java.io.EOFException: Expected to receive a Transaction Completion
> > Indicator from Peer nifiHost2.contoso.com but encountered EOF
> >         at
> > org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >         at
> > org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >         at
> > org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >         at
> > org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >         at java.lang.Thread.run(Thread.java:748)
> > 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
> > o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with
> Peer
> > nifiHost2.contoso.com/192.168.1.11:19502
> > java.io.EOFException: Expected to receive a Transaction Completion
> > Indicator from Peer nifiHost2.contoso.com but encountered EOF
> >         at
> > org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >         at
> > org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >         at
> > org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >         at
> > org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >         at java.lang.Thread.run(Thread.java:748)
> > 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
> > o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443
>
> > changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>
> > state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
> > nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
> > 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
> > o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> > 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
> > length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
>
> > 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> > o.a.nifi.controller.StandardFlowService Processing reconnection request
>
> > from cluster coordinator.
> > 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> > o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role
> Primary
> > Node disabled
> > 2021-08-11 13:31:13,802 INFO [main-EventThread]
> > o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> > 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer Shutting down
> > 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer shutting down
> > 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
> > o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
>
> > Repository
> > 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.z.server.FinalRequestProcessor shutdown of request processor complete
>
> > 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.z.server.SyncRequestProcessor Shutting down
> > 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> > o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> > 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
> > o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event
>
> > Writer for Provenance Event Store Partition[directory=/opt/pr] due to
>
> > MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
> > 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> > o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log
> with
> > 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time =
> 1
> > milliseconds), max Transaction ID 27772177643
> > 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> > o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
>
> > Repository with 10 records in 8 milliseconds
> > 2021-08-11 13:31:28,295 INFO [main-EventThread]
> > o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> > 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> > 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
>
> > 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> > o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
>
> > minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> > /opt/nifi/logs/zookeeper/version-2 snapdir
> > /opt/nifi/state/zookeeper/version-2
> > 2021-08-11 13:31:32,291 INFO [main-EventThread]
> > o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> >
> >
> > ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> > After that, the log continues with normal messages about heartbeats
> and
> > provenance events but no error occurs anymore. That exactly was the
> time,
> > the whole flow cut-off.
> >
> > I've got to add, that we've never experienced that flow cut-off behaviour
>
> > with version 1.12.1, but the zookeeper restarts occur there as well
> and as
> > often as seen above.
> > But when first updating to version 1.13.2 back in June, we experienced
>
> > this flow cut-off multiple times beside the not working load balancing.
>
> > Back then I thought this was connected to each other, but now we had
> a
> > fully functional load balancing and the cut-off happened nevertheless.
>
> >
> > I'm thankful for any hint. :)
> >
> >
> >
> >
>



Re: Disfunctional cluster with version 1.13.2

Posted by Pierre Villard <pi...@gmail.com>.
Hi,

What's the version of ZK?

Thanks,
Pierre

Le jeu. 12 août 2021 à 09:55, Axel Schwarz <Ax...@emailn.de> a écrit :

> Dear all,
>
> after successfully battling the load balancing and installing Version
> 1.13.2 again in our 3 node production environment, we experienced another
> failure in the cluster resulting in a complete cut-off of the flow just
> 1,5h after the update.
> We noticed it just by trying to access the webinterface, which immediately
> after login showed something like:
>
> "Cannot replicate request to Node nifiHost1.contoso.com:8443 because the
> node is not connected"
>
> There was nothing we could do through the webinterface aside from staring
> at this message and when looking at the live logs, there was nothing
> suspicious. The log moved on as if nothing happened.
> After a restart of the cluster everything was working fine again, but we
> saw, that the entire flow wasn't working for some period of time. This
> alone is really uncool, as we running a cluster for exactly that reason:
> The flow should keep working, even if some node decides to malfunction for
> whatever reason.
>
> Digging a little deeper into the logs showed two noticable problems:
>
> 1. The Zookeeper is restarting every few minutes. Which in the log always
> looks like this:
>
> ------------------------------------(nifiHost1.contoso.com)
> nifi-app.log------------------------------------------------------------------------------------
>
> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> 2021-08-11 12:02:39,187 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer shutting down
> 2021-08-11 12:02:39,194 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
> 2021-08-11 12:02:39,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.z.server.SyncRequestProcessor Shutting down
> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> 2021-08-11 12:02:39,199 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
> 2021-08-11 12:02:39,200 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> /opt/nifi/logs/zookeeper/version-2 snapdir
> /opt/nifi/state/zookeeper/version-2
> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread]
> org.apache.zookeeper.ClientCnxn Error while calling watcher
> java.lang.NullPointerException: null
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> 2021-08-11 12:02:39,277 INFO [main-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> 2021-08-11 12:02:40,001 INFO [main-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
> Connection State changed to SUSPENDED
> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
> Connection State changed to SUSPENDED
> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
> Connection State changed to RECONNECTED
> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
> Connection State changed to RECONNECTED
> 2021-08-11 12:02:40,014 INFO [main-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received: {}
> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11
> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11
> 12:02:40,127; send took 951 millis
> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread]
> org.apache.zookeeper.ClientCnxn Error while calling watcher
> java.lang.NullPointerException: null
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> This happens, as mentioned, every few minutes and looks like this when
> condensed through grep:
>
> ------------------------------------(nifiHost1.contoso.com) nifi-app.log
> | grep "ZooKeeperServer Shutting
> down"---------------------------------------------
>
> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> So my first question is: Is this behaviour of zookeeper normal? Well it is
> just an INFO, but I would love to here that this is a completely normal
> phenomenon. :D
>
> 2. After all these zookeeper restarts, nifi finally throws the following
> error along with the "last" zookeeper restart:
>
> ------------------------------------(nifiHost1.contoso.com)
> nifi-app.log------------------------------------------------------------------------------------
>
> 2021-08-11 13:31:13,083 INFO [main-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> 2021-08-11 13:31:13,437 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer shutting down
> 2021-08-11 13:31:13,439 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
> 2021-08-11 13:31:13,440 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.z.server.SyncRequestProcessor Shutting down
> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
> 2021-08-11 13:31:13,450 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> /opt/nifi/logs/zookeeper/version-2 snapdir
> /opt/nifi/state/zookeeper/version-2
> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
> nifiHost2.contoso.com/192.168.1.11:19500
> java.io.EOFException: Expected to receive a Transaction Completion
> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>         at
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>         at
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>         at
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>         at
> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
> nifiHost2.contoso.com/192.168.1.11:19502
> java.io.EOFException: Expected to receive a Transaction Completion
> Indicator from Peer nifiHost2.contoso.com but encountered EOF
>         at
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>         at
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>         at
> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>         at
> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1]
> o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443
> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7]
> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 millis
> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Processing reconnection request
> from cluster coordinator.
> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for Role Primary
> Node disabled
> 2021-08-11 13:31:13,802 INFO [main-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
> 2021-08-11 13:31:28,195 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Shutting down
> 2021-08-11 13:31:28,196 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer shutting down
> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
> 2021-08-11 13:31:28,202 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.z.server.FinalRequestProcessor shutdown of request processor complete
> 2021-08-11 13:31:28,203 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.z.server.SyncRequestProcessor Shutting down
> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited!
> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over Event
> Writer for Provenance Event Store Partition[directory=/opt/pr] due to
> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events.
> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead Log with
> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world time = 1
> milliseconds), max Transaction ID 27772177643
> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 10 records in 8 milliseconds
> 2021-08-11 13:31:28,295 INFO [main-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
> 2021-08-11 13:31:29,808 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to 4000
> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to 40000
> 2021-08-11 13:31:29,809 INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime 2000
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> /opt/nifi/logs/zookeeper/version-2 snapdir
> /opt/nifi/state/zookeeper/version-2
> 2021-08-11 13:31:32,291 INFO [main-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> After that, the log continues with normal messages about heartbeats and
> provenance events but no error occurs anymore. That exactly was the time,
> the whole flow cut-off.
>
> I've got to add, that we've never experienced that flow cut-off behaviour
> with version 1.12.1, but the zookeeper restarts occur there as well and as
> often as seen above.
> But when first updating to version 1.13.2 back in June, we experienced
> this flow cut-off multiple times beside the not working load balancing.
> Back then I thought this was connected to each other, but now we had a
> fully functional load balancing and the cut-off happened nevertheless.
>
> I'm thankful for any hint. :)
>
>
>
>