You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Wyll Ingersoll <wy...@keepertech.com> on 2020/09/29 14:57:50 UTC

Clustered nifi issues

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.



2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException

org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException

at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)

at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)

at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)

at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.NullPointerException: null

at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)

... 4 common frames omitted

2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting

2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes

2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema

2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED

2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181}

2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting

2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181}

2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered

2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

Thanks, I'll try it out.  I had the liveness probe already, but wasn't quite sure what to check for in the readinessProbe.
________________________________
From: Chris Sampson <ch...@naimuri.com>
Sent: Thursday, October 1, 2020 9:03 AM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

For info, the probes we currently use for our StatefulSet Pods are:

  *   livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
  *   readinessProbe - exec command to curl the nifi-api/controller/cluster endpoint to check the node's cluster connection status, e.g.:

readinessProbe:
exec:
command:
- bash
- -c
- |
if [ "${SECURE}" = "true" ]; then
INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')

curl -v \
--cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
--cert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem \
--key ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem \
https://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
else
curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then
echo "Node not found with CONNECTED state. Full cluster state:"
jq . /tmp/cluster.state
exit 1
fi

Note that INITIAL_ADMIN is the CN of a user with appropriate permissions to call the endpoint and for whom our pod contains a set of certificate files in the indicated locations (generated from NiFi Toolkit in an init-container before the Pod starts); jq utility was added into our customised version of the apache/nifi Docker Image.

---
Chris Sampson
IT Consultant
chris.sampson@naimuri.com<ma...@naimuri.com>
[https://docs.google.com/uc?export=download&id=1oPtzd0P7DqtuzpjiTRAa6h6coFitpqom&revid=0B9aXwC5rMc6lVlZ2OWpUaVlFVmUwTlZBdjQ0KzAxb1dZS2hJPQ]<https://www.naimuri.com/>

On Wed, 30 Sep 2020 at 16:43, Wyll Ingersoll <wy...@keepertech.com>> wrote:
Thanks for following up and filing the issue. Unfortunately, I dont have any of the logs from the original issue since I have since restarted and rebooted my containers many times.
________________________________
From: Mark Payne <ma...@hotmail.com>>
Sent: Wednesday, September 30, 2020 11:21 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Thanks Wyll,

I created a Jira [1] to address this. The NullPointer that you show in the stack trace will prevent the node from reconnecting to the cluster. Unfortunately, it’s a bug that needs to be addressed. It’s possible that you may find a way to work around the issue, but I can’t tell you off the top of my head what that would be.

Can you check the logs for anything else from the StandardFlowService class? That may help to understand why the null value is getting returned, causing the NullPointerException that you’re seeing.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-7866

On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:

1.11.4
________________________________
From: Mark Payne <ma...@hotmail.com>>
Sent: Wednesday, September 30, 2020 11:02 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Wyll,

What version of nifi are you running?

Thanks
-Mark

On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:

  *   Yes - the host specific parameters on the different instances are configured correctly (nifi-0, nifi-1, nifi-2)
  *   Yes - we have separate certificate for each node and the keystores are configured correctly.
  *   Yes - we have a headless service in front of the STS cluster
  *   No - I don't think there is an explicit liveness or readiness probe defined for the STS, perhaps I need to add one. Do you have an example?

-Wyllys

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 3:21 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host

Yes

And are you using a separate (non-wildcard) certificate for each node?

Do you have liveness/readiness probes set on your nifi sts?

And are you using a headless service[1] to manage the cluster during startup?

[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services

Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys

nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
nifi.flow.configuration.archive.enabled=true
nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
nifi.flow.configuration.archive.max.time=30 days
nifi.flow.configuration.archive.max.storage=500 MB
nifi.flow.configuration.archive.max.count=
nifi.flowcontroller.autoResumeState=false
nifi.flowcontroller.graceful.shutdown.period=10 sec
nifi.flowservice.writedelay.interval=500 ms
nifi.administrative.yield.duration=30 sec

nifi.bored.yield.duration=10 millis
nifi.queue.backpressure.count=10000
nifi.queue.backpressure.size=1 GB

nifi.authorizer.configuration.file=./conf/authorizers.xml
nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.templates.directory=/opt/nifi/nifi-current/templates
nifi.ui.banner.text=KI Nifi Cluster
nifi.ui.autorefresh.interval=30 sec
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
nifi.documentation.working.directory=./work/docs/components

nifi.state.management.configuration.file=./conf/state-management.xml
nifi.state.management.provider.local=local-provider
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties

nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id/>=
nifi.flowfile.repository.encryption.key=

nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
nifi.swap.in.period=5 sec
nifi.swap.in.threads=1
nifi.swap.out.period=5 sec
nifi.swap.out.threads=4

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id/>=
nifi.content.repository.encryption.key=

nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id/>=
nifi.provenance.repository.encryption.key=

nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=7 days
nifi.provenance.repository.max.storage.size=100 GB
nifi.provenance.repository.rollover.time=120 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
nifi.provenance.repository.indexed.attributes=
nifi.provenance.repository.index.shard.size=4 GB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.buffer.size=100000

nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min

nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10000
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs

nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.http.network.interface.default=
nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
nifi.web.https.port=8080
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=/nifi-api,/nifi
nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com/>

nifi.sensitive.props.key=
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=

nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
nifi.security.user.authorizer=managed-authorizer
nifi.security.user.login.identity.provider=
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=

nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
nifi.security.user.oidc.connect.timeout=15 secs
nifi.security.user.oidc.read.timeout=15 secs
nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id/>=nifi
nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
nifi.security.user.oidc.preferred.jwsalgorithm=RS512
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=

nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=

nifi.cluster.protocol.heartbeat.interval=30 secs
nifi.cluster.protocol.is.secure=true

nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.node.protocol.port=2882
nifi.cluster.node.protocol.threads=40
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=120 secs
nifi.cluster.node.read.timeout=120 secs
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=

nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=

nifi.kerberos.krb5.file=

nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=

nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours

nifi.variable.registry.properties=

nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name/>=rSquared
nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?

Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.

2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

I see what's happening.  The container sets up a /root/.nifi-cli.config file that has the required security parameters so that the user doesn't have to supply them on the command line.
________________________________
From: Bryan Bende <bb...@gmail.com>
Sent: Wednesday, October 14, 2020 10:45 AM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

The CLI does not use nifi.properties, there are several ways of passing in config...

https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#property-argument-handling

On Wed, Oct 14, 2020 at 10:01 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:
That makes sense.  It must be reading the keystore/truststore specified in the nifi.properties file then?
________________________________
From: Bryan Bende <bb...@gmail.com>>
Sent: Wednesday, October 14, 2020 9:59 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

The get-nodes command calls the REST resource /controller/cluster which authorizes against READ on /controller [1], so there is no way you can call this in a secure environment without authenticating somehow, which from the CLI means specifying a keystore/truststore.

[1] https://github.com/apache/nifi/blob/main/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-api/src/main/java/org/apache/nifi/web/api/ControllerResource.java#L857

On Wed, Oct 14, 2020 at 9:26 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:
Yes, this is for a secured cluster deployed as a Kubernetes stateful set.  The certificate parameters are apparently not needed to just get the status of the nodes using the command below.

________________________________
From: Sushil Kumar <sk...@gmail.com>>
Sent: Tuesday, October 13, 2020 4:01 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Did you say that the same line of code works fine for secured clusters too.
I asked because nifi-toolkit has a separate set of parameters asking for certificates and everything else related to secure clusters.

On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I found that instead of dealing with nifi client certificate hell, the nifi-toolkit cli.sh will work just fine for testing the readiness of the cluster.  Here is my readiness script which seems to work just fine with in kubernetes with the apache/nifi docker container version 1.12.1

#!/bin/bash

$NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state

if [ $? -ne 0 ]; then

        cat /tmp/cluster.state

        exit 1

fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then

        echo "Node not found with CONNECTED state. Full cluster state:"

        jq . /tmp/cluster.state

        exit 1

fi

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Thursday, October 1, 2020 9:03 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

For info, the probes we currently use for our StatefulSet Pods are:

  *   livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
  *   readinessProbe - exec command to curl the nifi-api/controller/cluster endpoint to check the node's cluster connection status, e.g.:

readinessProbe:
exec:
command:
- bash
- -c
- |
if [ "${SECURE}" = "true" ]; then
INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')

curl -v \
--cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
--cert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem \
--key ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem \
https://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
else
curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then
echo "Node not found with CONNECTED state. Full cluster state:"
jq . /tmp/cluster.state
exit 1
fi

Note that INITIAL_ADMIN is the CN of a user with appropriate permissions to call the endpoint and for whom our pod contains a set of certificate files in the indicated locations (generated from NiFi Toolkit in an init-container before the Pod starts); jq utility was added into our customised version of the apache/nifi Docker Image.

---
Chris Sampson
IT Consultant
chris.sampson@naimuri.com<ma...@naimuri.com>
[X]<https://www.naimuri.com/>

Re: Clustered nifi issues

Posted by Bryan Bende <bb...@gmail.com>.

The CLI does not use nifi.properties, there are several ways of passing in
config...

https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#property-argument-handling

On Wed, Oct 14, 2020 at 10:01 AM Wyll Ingersoll <
wyllys.ingersoll@keepertech.com> wrote:

> That makes sense.  It must be reading the keystore/truststore specified in
> the nifi.properties file then?
> ------------------------------
> *From:* Bryan Bende <bb...@gmail.com>
> *Sent:* Wednesday, October 14, 2020 9:59 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> The get-nodes command calls the REST resource /controller/cluster which
> authorizes against READ on /controller [1], so there is no way you can call
> this in a secure environment without authenticating somehow, which from the
> CLI means specifying a keystore/truststore.
>
> [1]
> https://github.com/apache/nifi/blob/main/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-api/src/main/java/org/apache/nifi/web/api/ControllerResource.java#L857
>
> On Wed, Oct 14, 2020 at 9:26 AM Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
> Yes, this is for a secured cluster deployed as a Kubernetes stateful set.
> The certificate parameters are apparently not needed to just get the status
> of the nodes using the command below.
>
>
>
> ------------------------------
> *From:* Sushil Kumar <sk...@gmail.com>
> *Sent:* Tuesday, October 13, 2020 4:01 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Did you say that the same line of code works fine for secured clusters
> too.
> I asked because nifi-toolkit has a separate set of parameters asking for
> certificates and everything else related to secure clusters.
>
>
> On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>
> I found that instead of dealing with nifi client certificate hell, the
> nifi-toolkit cli.sh will work just fine for testing the readiness of the
> cluster.  Here is my readiness script which seems to work just fine with in
> kubernetes with the apache/nifi docker container version 1.12.1
>
>
> #!/bin/bash
>
>
> $NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state
>
> if [ $? -ne 0 ]; then
>
>         cat /tmp/cluster.state
>
>         exit 1
>
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
>
>         echo "Node not found with CONNECTED state. Full cluster state:"
>
>         jq . /tmp/cluster.state
>
>         exit 1
>
> fi
>
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Thursday, October 1, 2020 9:03 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> For info, the probes we currently use for our StatefulSet Pods are:
>
>    - livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
>    - readinessProbe - exec command to curl the
>    nifi-api/controller/cluster endpoint to check the node's cluster connection
>    status, e.g.:
>
> readinessProbe:
> exec:
> command:
> - bash
> - -c
> - |
> if [ "${SECURE}" = "true" ]; then
> INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]'
> | tr ' ' '-')
>
> curl -v \
> --cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
> --cert
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem
> \
> --key
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem
> \
> https://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> else
> curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
> echo "Node not found with CONNECTED state. Full cluster state:"
> jq . /tmp/cluster.state
> exit 1
> fi
>
>
> Note that INITIAL_ADMIN is the CN of a user with appropriate permissions
> to call the endpoint and for whom our pod contains a set of certificate
> files in the indicated locations (generated from NiFi Toolkit in an
> init-container before the Pod starts); jq utility was added into our
> customised version of the apache/nifi Docker Image.
>
>
> ---
> *Chris Sampson*
> IT Consultant
> chris.sampson@naimuri.com
> <https://www.naimuri.com/>
>
>
>

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

That makes sense.  It must be reading the keystore/truststore specified in the nifi.properties file then?
________________________________
From: Bryan Bende <bb...@gmail.com>
Sent: Wednesday, October 14, 2020 9:59 AM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

The get-nodes command calls the REST resource /controller/cluster which authorizes against READ on /controller [1], so there is no way you can call this in a secure environment without authenticating somehow, which from the CLI means specifying a keystore/truststore.

[1] https://github.com/apache/nifi/blob/main/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-api/src/main/java/org/apache/nifi/web/api/ControllerResource.java#L857

On Wed, Oct 14, 2020 at 9:26 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:
Yes, this is for a secured cluster deployed as a Kubernetes stateful set.  The certificate parameters are apparently not needed to just get the status of the nodes using the command below.

________________________________
From: Sushil Kumar <sk...@gmail.com>>
Sent: Tuesday, October 13, 2020 4:01 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Did you say that the same line of code works fine for secured clusters too.
I asked because nifi-toolkit has a separate set of parameters asking for certificates and everything else related to secure clusters.

On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I found that instead of dealing with nifi client certificate hell, the nifi-toolkit cli.sh will work just fine for testing the readiness of the cluster.  Here is my readiness script which seems to work just fine with in kubernetes with the apache/nifi docker container version 1.12.1

#!/bin/bash

$NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state

if [ $? -ne 0 ]; then

        cat /tmp/cluster.state

        exit 1

fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then

        echo "Node not found with CONNECTED state. Full cluster state:"

        jq . /tmp/cluster.state

        exit 1

fi

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Thursday, October 1, 2020 9:03 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

For info, the probes we currently use for our StatefulSet Pods are:

  *   livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
  *   readinessProbe - exec command to curl the nifi-api/controller/cluster endpoint to check the node's cluster connection status, e.g.:

readinessProbe:
exec:
command:
- bash
- -c
- |
if [ "${SECURE}" = "true" ]; then
INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')

curl -v \
--cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
--cert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem \
--key ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem \
https://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
else
curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then
echo "Node not found with CONNECTED state. Full cluster state:"
jq . /tmp/cluster.state
exit 1
fi

Note that INITIAL_ADMIN is the CN of a user with appropriate permissions to call the endpoint and for whom our pod contains a set of certificate files in the indicated locations (generated from NiFi Toolkit in an init-container before the Pod starts); jq utility was added into our customised version of the apache/nifi Docker Image.

---
Chris Sampson
IT Consultant
chris.sampson@naimuri.com<ma...@naimuri.com>
[X]<https://www.naimuri.com/>

Re: Clustered nifi issues

Posted by Bryan Bende <bb...@gmail.com>.

The get-nodes command calls the REST resource /controller/cluster which
authorizes against READ on /controller [1], so there is no way you can call
this in a secure environment without authenticating somehow, which from the
CLI means specifying a keystore/truststore.

[1]
https://github.com/apache/nifi/blob/main/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-api/src/main/java/org/apache/nifi/web/api/ControllerResource.java#L857

On Wed, Oct 14, 2020 at 9:26 AM Wyll Ingersoll <
wyllys.ingersoll@keepertech.com> wrote:

> Yes, this is for a secured cluster deployed as a Kubernetes stateful set.
> The certificate parameters are apparently not needed to just get the status
> of the nodes using the command below.
>
>
>
> ------------------------------
> *From:* Sushil Kumar <sk...@gmail.com>
> *Sent:* Tuesday, October 13, 2020 4:01 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Did you say that the same line of code works fine for secured clusters
> too.
> I asked because nifi-toolkit has a separate set of parameters asking for
> certificates and everything else related to secure clusters.
>
>
> On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>
> I found that instead of dealing with nifi client certificate hell, the
> nifi-toolkit cli.sh will work just fine for testing the readiness of the
> cluster.  Here is my readiness script which seems to work just fine with in
> kubernetes with the apache/nifi docker container version 1.12.1
>
>
> #!/bin/bash
>
>
> $NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state
>
> if [ $? -ne 0 ]; then
>
>         cat /tmp/cluster.state
>
>         exit 1
>
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
>
>         echo "Node not found with CONNECTED state. Full cluster state:"
>
>         jq . /tmp/cluster.state
>
>         exit 1
>
> fi
>
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Thursday, October 1, 2020 9:03 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> For info, the probes we currently use for our StatefulSet Pods are:
>
>    - livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
>    - readinessProbe - exec command to curl the
>    nifi-api/controller/cluster endpoint to check the node's cluster connection
>    status, e.g.:
>
> readinessProbe:
> exec:
> command:
> - bash
> - -c
> - |
> if [ "${SECURE}" = "true" ]; then
> INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]'
> | tr ' ' '-')
>
> curl -v \
> --cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
> --cert
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem
> \
> --key
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem
> \
> https://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> else
> curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
> echo "Node not found with CONNECTED state. Full cluster state:"
> jq . /tmp/cluster.state
> exit 1
> fi
>
>
> Note that INITIAL_ADMIN is the CN of a user with appropriate permissions
> to call the endpoint and for whom our pod contains a set of certificate
> files in the indicated locations (generated from NiFi Toolkit in an
> init-container before the Pod starts); jq utility was added into our
> customised version of the apache/nifi Docker Image.
>
>
> ---
> *Chris Sampson*
> IT Consultant
> chris.sampson@naimuri.com
> <https://www.naimuri.com/>
>
>
>

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

Yes, this is for a secured cluster deployed as a Kubernetes stateful set.  The certificate parameters are apparently not needed to just get the status of the nodes using the command below.

________________________________
From: Sushil Kumar <sk...@gmail.com>
Sent: Tuesday, October 13, 2020 4:01 PM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

Did you say that the same line of code works fine for secured clusters too.
I asked because nifi-toolkit has a separate set of parameters asking for certificates and everything else related to secure clusters.

On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I found that instead of dealing with nifi client certificate hell, the nifi-toolkit cli.sh will work just fine for testing the readiness of the cluster.  Here is my readiness script which seems to work just fine with in kubernetes with the apache/nifi docker container version 1.12.1

#!/bin/bash

$NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state

if [ $? -ne 0 ]; then

        cat /tmp/cluster.state

        exit 1

fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then

        echo "Node not found with CONNECTED state. Full cluster state:"

        jq . /tmp/cluster.state

        exit 1

fi

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Thursday, October 1, 2020 9:03 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

For info, the probes we currently use for our StatefulSet Pods are:

  *   livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
  *   readinessProbe - exec command to curl the nifi-api/controller/cluster endpoint to check the node's cluster connection status, e.g.:

readinessProbe:
exec:
command:
- bash
- -c
- |
if [ "${SECURE}" = "true" ]; then
INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')

curl -v \
--cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
--cert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem \
--key ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem \
https://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
else
curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then
echo "Node not found with CONNECTED state. Full cluster state:"
jq . /tmp/cluster.state
exit 1
fi

Note that INITIAL_ADMIN is the CN of a user with appropriate permissions to call the endpoint and for whom our pod contains a set of certificate files in the indicated locations (generated from NiFi Toolkit in an init-container before the Pod starts); jq utility was added into our customised version of the apache/nifi Docker Image.

---
Chris Sampson
IT Consultant
chris.sampson@naimuri.com<ma...@naimuri.com>
[https://docs.google.com/uc?export=download&id=1oPtzd0P7DqtuzpjiTRAa6h6coFitpqom&revid=0B9aXwC5rMc6lVlZ2OWpUaVlFVmUwTlZBdjQ0KzAxb1dZS2hJPQ]<https://www.naimuri.com/>

Re: Clustered nifi issues

Posted by Sushil Kumar <sk...@gmail.com>.

Did you say that the same line of code works fine for secured clusters too.
I asked because nifi-toolkit has a separate set of parameters asking for
certificates and everything else related to secure clusters.


On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll <
wyllys.ingersoll@keepertech.com> wrote:

>
> I found that instead of dealing with nifi client certificate hell, the
> nifi-toolkit cli.sh will work just fine for testing the readiness of the
> cluster.  Here is my readiness script which seems to work just fine with in
> kubernetes with the apache/nifi docker container version 1.12.1
>
>
> #!/bin/bash
>
>
> $NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state
>
> if [ $? -ne 0 ]; then
>
>         cat /tmp/cluster.state
>
>         exit 1
>
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
>
>         echo "Node not found with CONNECTED state. Full cluster state:"
>
>         jq . /tmp/cluster.state
>
>         exit 1
>
> fi
>
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Thursday, October 1, 2020 9:03 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> For info, the probes we currently use for our StatefulSet Pods are:
>
>    - livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
>    - readinessProbe - exec command to curl the
>    nifi-api/controller/cluster endpoint to check the node's cluster connection
>    status, e.g.:
>
> readinessProbe:
> exec:
> command:
> - bash
> - -c
> - |
> if [ "${SECURE}" = "true" ]; then
> INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]'
> | tr ' ' '-')
>
> curl -v \
> --cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
> --cert
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem
> \
> --key
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem
> \
> https://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> else
> curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
> echo "Node not found with CONNECTED state. Full cluster state:"
> jq . /tmp/cluster.state
> exit 1
> fi
>
>
> Note that INITIAL_ADMIN is the CN of a user with appropriate permissions
> to call the endpoint and for whom our pod contains a set of certificate
> files in the indicated locations (generated from NiFi Toolkit in an
> init-container before the Pod starts); jq utility was added into our
> customised version of the apache/nifi Docker Image.
>
>
> ---
> *Chris Sampson*
> IT Consultant
> chris.sampson@naimuri.com
> <https://www.naimuri.com/>
>
>
> On Wed, 30 Sep 2020 at 16:43, Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
> Thanks for following up and filing the issue. Unfortunately, I dont have
> any of the logs from the original issue since I have since restarted and
> rebooted my containers many times.
> ------------------------------
> *From:* Mark Payne <ma...@hotmail.com>
> *Sent:* Wednesday, September 30, 2020 11:21 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Thanks Wyll,
>
> I created a Jira [1] to address this. The NullPointer that you show in the
> stack trace will prevent the node from reconnecting to the cluster.
> Unfortunately, it’s a bug that needs to be addressed. It’s possible that
> you may find a way to work around the issue, but I can’t tell you off the
> top of my head what that would be.
>
> Can you check the logs for anything else from the StandardFlowService
> class? That may help to understand why the null value is getting returned,
> causing the NullPointerException that you’re seeing.
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-7866
>
> On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
> 1.11.4
> ------------------------------
> *From:* Mark Payne <ma...@hotmail.com>
> *Sent:* Wednesday, September 30, 2020 11:02 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Wyll,
>
> What version of nifi are you running?
>
> Thanks
> -Mark
>
>
> On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>
>    - Yes - the host specific parameters on the different instances are
>    configured correctly (nifi-0, nifi-1, nifi-2)
>    - Yes - we have separate certificate for each node and the keystores
>    are configured correctly.
>    - Yes - we have a headless service in front of the STS cluster
>    - No - I don't think there is an explicit liveness or readiness probe
>    defined for the STS, perhaps I need to add one. Do you have an example?
>
>
> -Wyllys
>
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Tuesday, September 29, 2020 3:21 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> We started to have more stability when we switched to
> bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.
>
> Your properties have nifi-0 in several places, so just to double check
> that the relevant properties are changed for each of the instances within
> your statefulset?
>
> For example:
> * nifi.remote.input.host
> * nifi.cluster.node.address
> * nifi.web.https.host
>
>
> Yes
>
> And are you using a separate (non-wildcard) certificate for each node?
>
>
> Do you have liveness/readiness probes set on your nifi sts?
>
>
> And are you using a headless service[1] to manage the cluster during
> startup?
>
>
> [1]
> https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <
> wyllys.ingersoll@keepertech.com> wrote:
>
> Zookeeper is from the docker hub zookeeper:3.5.7 image.
>
> Below is our nifi.properties (with secrets and hostnames modified).
>
> thanks!
>  - Wyllys
>
>
>
> nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
> nifi.flow.configuration.archive.enabled=true
> nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
> nifi.flow.configuration.archive.max.time=30 days
> nifi.flow.configuration.archive.max.storage=500 MB
> nifi.flow.configuration.archive.max.count=
> nifi.flowcontroller.autoResumeState=false
> nifi.flowcontroller.graceful.shutdown.period=10 sec
> nifi.flowservice.writedelay.interval=500 ms
> nifi.administrative.yield.duration=30 sec
>
> nifi.bored.yield.duration=10 millis
> nifi.queue.backpressure.count=10000
> nifi.queue.backpressure.size=1 GB
>
> nifi.authorizer.configuration.file=./conf/authorizers.xml
>
> nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
> nifi.templates.directory=/opt/nifi/nifi-current/templates
> nifi.ui.banner.text=KI Nifi Cluster
> nifi.ui.autorefresh.interval=30 sec
> nifi.nar.library.directory=./lib
> nifi.nar.library.autoload.directory=./extensions
> nifi.nar.working.directory=./work/nar/
> nifi.documentation.working.directory=./work/docs/components
>
> nifi.state.management.configuration.file=./conf/state-management.xml
> nifi.state.management.provider.local=local-provider
> nifi.state.management.provider.cluster=zk-provider
> nifi.state.management.embedded.zookeeper.start=false
>
> nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
>
> nifi.database.directory=./database_repository
> nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE
>
>
> nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
>
> nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
> nifi.flowfile.repository.directory=./flowfile_repository
> nifi.flowfile.repository.partitions=256
> nifi.flowfile.repository.checkpoint.interval=2 mins
> nifi.flowfile.repository.always.sync=false
> nifi.flowfile.repository.encryption.key.provider.implementation=
> nifi.flowfile.repository.encryption.key.provider.location=
> nifi.flowfile.repository.encryption.key.id=
> nifi.flowfile.repository.encryption.key=
>
>
> nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
> nifi.queue.swap.threshold=20000
> nifi.swap.in.period=5 sec
> nifi.swap.in.threads=1
> nifi.swap.out.period=5 sec
> nifi.swap.out.threads=4
>
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
> nifi.content.claim.max.appendable.size=1 MB
> nifi.content.claim.max.flow.files=100
> nifi.content.repository.directory.default=./content_repository
> nifi.content.repository.archive.max.retention.period=12 hours
> nifi.content.repository.archive.max.usage.percentage=50%
> nifi.content.repository.archive.enabled=true
> nifi.content.repository.always.sync=false
> nifi.content.viewer.url=../nifi-content-viewer/
> nifi.content.repository.encryption.key.provider.implementation=
> nifi.content.repository.encryption.key.provider.location=
> nifi.content.repository.encryption.key.id=
> nifi.content.repository.encryption.key=
>
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
> nifi.provenance.repository.debug.frequency=1_000_000
> nifi.provenance.repository.encryption.key.provider.implementation=
> nifi.provenance.repository.encryption.key.provider.location=
> nifi.provenance.repository.encryption.key.id=
> nifi.provenance.repository.encryption.key=
>
> nifi.provenance.repository.directory.default=./provenance_repository
> nifi.provenance.repository.max.storage.time=7 days
> nifi.provenance.repository.max.storage.size=100 GB
> nifi.provenance.repository.rollover.time=120 secs
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=2
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
> Filename, ProcessorID, Relationship
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=4 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=2
> nifi.provenance.repository.buffer.size=100000
>
>
> nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
> nifi.components.status.repository.buffer.size=1440
> nifi.components.status.snapshot.frequency=1 min
>
> nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.remote.input.secure=true
> nifi.remote.input.socket.port=10000
> nifi.remote.input.http.enabled=true
> nifi.remote.input.http.transaction.ttl=30 sec
> nifi.remote.contents.cache.expiration=30 secs
>
> nifi.web.war.directory=./lib
> nifi.web.http.host=
> nifi.web.http.port=
> nifi.web.http.network.interface.default=
> nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.web.https.port=8080
> nifi.web.https.network.interface.default=
> nifi.web.jetty.working.directory=./work/jetty
> nifi.web.jetty.threads=200
> nifi.web.max.header.size=16 KB
> nifi.web.proxy.context.path=/nifi-api,/nifi
> nifi.web.proxy.host=ingress.ourdomain.com
>
> nifi.sensitive.props.key=
> nifi.sensitive.props.key.protected=
> nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
> nifi.sensitive.props.provider=BC
> nifi.sensitive.props.additional.keys=
>
> nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
> nifi.security.keystoreType=jks
> nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
> nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
>
> nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
> nifi.security.truststoreType=jks
> nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
> nifi.security.user.authorizer=managed-authorizer
> nifi.security.user.login.identity.provider=
> nifi.security.ocsp.responder.url=
> nifi.security.ocsp.responder.certificate=
>
> nifi.security.user.oidc.discovery.url=
> https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
> nifi.security.user.oidc.connect.timeout=15 secs
> nifi.security.user.oidc.read.timeout=15 secs
> nifi.security.user.oidc.client.id=nifi
> nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
> nifi.security.user.oidc.preferred.jwsalgorithm=RS512
> nifi.security.user.oidc.additional.scopes=
> nifi.security.user.oidc.claim.identifying.user=
>
> nifi.security.user.knox.url=
> nifi.security.user.knox.publicKey=
> nifi.security.user.knox.cookieName=hadoop-jwt
> nifi.security.user.knox.audiences=
>
> nifi.cluster.protocol.heartbeat.interval=30 secs
> nifi.cluster.protocol.is.secure=true
>
> nifi.cluster.is.node=true
> nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
> nifi.cluster.node.protocol.port=2882
> nifi.cluster.node.protocol.threads=40
> nifi.cluster.node.protocol.max.threads=50
> nifi.cluster.node.event.history.size=25
> nifi.cluster.node.connection.timeout=120 secs
> nifi.cluster.node.read.timeout=120 secs
> nifi.cluster.node.max.concurrent.requests=100
> nifi.cluster.firewall.file=
> nifi.cluster.flow.election.max.wait.time=5 mins
> nifi.cluster.flow.election.max.candidates=
>
> nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.cluster.load.balance.port=6342
> nifi.cluster.load.balance.connections.per.node=4
> nifi.cluster.load.balance.max.thread.count=8
> nifi.cluster.load.balance.comms.timeout=30 sec
>
>
> nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
> nifi.zookeeper.connect.timeout=30 secs
> nifi.zookeeper.session.timeout=30 secs
> nifi.zookeeper.root.node=/nifi
> nifi.zookeeper.auth.type=
> nifi.zookeeper.kerberos.removeHostFromPrincipal=
> nifi.zookeeper.kerberos.removeRealmFromPrincipal=
>
> nifi.kerberos.krb5.file=
>
> nifi.kerberos.service.principal=
> nifi.kerberos.service.keytab.location=
>
> nifi.kerberos.spnego.principal=
> nifi.kerberos.spnego.keytab.location=
> nifi.kerberos.spnego.authentication.expiration=12 hours
>
> nifi.variable.registry.properties=
>
> nifi.analytics.predict.enabled=false
> nifi.analytics.predict.interval=3 mins
> nifi.analytics.query.interval=5 mins
>
> nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
> nifi.analytics.connection.model.score.name=rSquared
> nifi.analytics.connection.model.score.threshold=.90
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Tuesday, September 29, 2020 12:41 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Also, which version of zookeeper and what image (I've found different
> versions and images provided better stability)?
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com> wrote:
>
> Hello Wyll
>
> It may be helpful if you can send nifi.properties.
>
> Thanks
> Sushil Kumar
>
> On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>
> I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
> StatefulSet) using external zookeeper (3 nodes also) to manage state.
>
> Whenever even 1 node (pod/container) goes down or is restarted, it can
> throw the whole cluster into a bad state that forces me to restart ALL of
> the pods in order to recover.  This seems wrong.  The problem seems to be
> that when the primary node goes away, the remaining 2 nodes don't ever try
> to take over.  Instead, I have restart all of them individually until one
> of them becomes the primary, then the other 2 eventually join and sync up.
>
> When one of the nodes is refusing to sync up, I often see these errors in
> the log and the only way to get it back into the cluster is to restart it.
> The node showing the errors below never seems to be able to rejoin or
> resync with the other 2 nodes.
>
>
> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request
> failed due to: org.apache.nifi.cluster.ConnectionException: Failed to
> connect node to cluster due to: java.lang.NullPointerException
> org.apache.nifi.cluster.ConnectionException: Failed to connect node to
> cluster due to: java.lang.NullPointerException
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
> at
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException: null
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
> ... 4 common frames omitted
> 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Starting
> 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
> 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Default schema
> 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: CONNECTED
> 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
> 2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
> o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
> 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
> 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Primary Node' becuase that role is not registered
> 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Cluster Coordinator' becuase that role is not registered
>
>
>

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

I found that instead of dealing with nifi client certificate hell, the nifi-toolkit cli.sh will work just fine for testing the readiness of the cluster.  Here is my readiness script which seems to work just fine with in kubernetes with the apache/nifi docker container version 1.12.1



#!/bin/bash


$NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state

if [ $? -ne 0 ]; then

        cat /tmp/cluster.state

        exit 1

fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then

        echo "Node not found with CONNECTED state. Full cluster state:"

        jq . /tmp/cluster.state

        exit 1

fi


________________________________
From: Chris Sampson <ch...@naimuri.com>
Sent: Thursday, October 1, 2020 9:03 AM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

For info, the probes we currently use for our StatefulSet Pods are:

  *   livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
  *   readinessProbe - exec command to curl the nifi-api/controller/cluster endpoint to check the node's cluster connection status, e.g.:

readinessProbe:
exec:
command:
- bash
- -c
- |
if [ "${SECURE}" = "true" ]; then
INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')

curl -v \
--cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
--cert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem \
--key ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem \
https://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
else
curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
fi

STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or .address==\"localhost\") | .status" /tmp/cluster.state)

if [[ ! $STATUS = "CONNECTED" ]]; then
echo "Node not found with CONNECTED state. Full cluster state:"
jq . /tmp/cluster.state
exit 1
fi

Note that INITIAL_ADMIN is the CN of a user with appropriate permissions to call the endpoint and for whom our pod contains a set of certificate files in the indicated locations (generated from NiFi Toolkit in an init-container before the Pod starts); jq utility was added into our customised version of the apache/nifi Docker Image.


---
Chris Sampson
IT Consultant
chris.sampson@naimuri.com<ma...@naimuri.com>
[https://docs.google.com/uc?export=download&id=1oPtzd0P7DqtuzpjiTRAa6h6coFitpqom&revid=0B9aXwC5rMc6lVlZ2OWpUaVlFVmUwTlZBdjQ0KzAxb1dZS2hJPQ]<https://www.naimuri.com/>


On Wed, 30 Sep 2020 at 16:43, Wyll Ingersoll <wy...@keepertech.com>> wrote:
Thanks for following up and filing the issue. Unfortunately, I dont have any of the logs from the original issue since I have since restarted and rebooted my containers many times.
________________________________
From: Mark Payne <ma...@hotmail.com>>
Sent: Wednesday, September 30, 2020 11:21 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Thanks Wyll,

I created a Jira [1] to address this. The NullPointer that you show in the stack trace will prevent the node from reconnecting to the cluster. Unfortunately, it’s a bug that needs to be addressed. It’s possible that you may find a way to work around the issue, but I can’t tell you off the top of my head what that would be.

Can you check the logs for anything else from the StandardFlowService class? That may help to understand why the null value is getting returned, causing the NullPointerException that you’re seeing.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-7866

On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:

1.11.4
________________________________
From: Mark Payne <ma...@hotmail.com>>
Sent: Wednesday, September 30, 2020 11:02 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Wyll,

What version of nifi are you running?

Thanks
-Mark


On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:


  *   Yes - the host specific parameters on the different instances are configured correctly (nifi-0, nifi-1, nifi-2)
  *   Yes - we have separate certificate for each node and the keystores are configured correctly.
  *   Yes - we have a headless service in front of the STS cluster
  *   No - I don't think there is an explicit liveness or readiness probe defined for the STS, perhaps I need to add one. Do you have an example?

-Wyllys


________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 3:21 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host


Yes

And are you using a separate (non-wildcard) certificate for each node?


Do you have liveness/readiness probes set on your nifi sts?


And are you using a headless service[1] to manage the cluster during startup?


[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys


nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
nifi.flow.configuration.archive.enabled=true
nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
nifi.flow.configuration.archive.max.time=30 days
nifi.flow.configuration.archive.max.storage=500 MB
nifi.flow.configuration.archive.max.count=
nifi.flowcontroller.autoResumeState=false
nifi.flowcontroller.graceful.shutdown.period=10 sec
nifi.flowservice.writedelay.interval=500 ms
nifi.administrative.yield.duration=30 sec

nifi.bored.yield.duration=10 millis
nifi.queue.backpressure.count=10000
nifi.queue.backpressure.size=1 GB

nifi.authorizer.configuration.file=./conf/authorizers.xml
nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.templates.directory=/opt/nifi/nifi-current/templates
nifi.ui.banner.text=KI Nifi Cluster
nifi.ui.autorefresh.interval=30 sec
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
nifi.documentation.working.directory=./work/docs/components

nifi.state.management.configuration.file=./conf/state-management.xml
nifi.state.management.provider.local=local-provider
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties

nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id/>=
nifi.flowfile.repository.encryption.key=

nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
nifi.swap.in.period=5 sec
nifi.swap.in.threads=1
nifi.swap.out.period=5 sec
nifi.swap.out.threads=4

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id/>=
nifi.content.repository.encryption.key=

nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id/>=
nifi.provenance.repository.encryption.key=

nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=7 days
nifi.provenance.repository.max.storage.size=100 GB
nifi.provenance.repository.rollover.time=120 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
nifi.provenance.repository.indexed.attributes=
nifi.provenance.repository.index.shard.size=4 GB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.buffer.size=100000

nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min

nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10000
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs

nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.http.network.interface.default=
nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
nifi.web.https.port=8080
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=/nifi-api,/nifi
nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com/>

nifi.sensitive.props.key=
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=

nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
nifi.security.user.authorizer=managed-authorizer
nifi.security.user.login.identity.provider=
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=

nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
nifi.security.user.oidc.connect.timeout=15 secs
nifi.security.user.oidc.read.timeout=15 secs
nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id/>=nifi
nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
nifi.security.user.oidc.preferred.jwsalgorithm=RS512
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=

nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=

nifi.cluster.protocol.heartbeat.interval=30 secs
nifi.cluster.protocol.is.secure=true

nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.node.protocol.port=2882
nifi.cluster.node.protocol.threads=40
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=120 secs
nifi.cluster.node.read.timeout=120 secs
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=

nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=

nifi.kerberos.krb5.file=

nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=

nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours

nifi.variable.registry.properties=

nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name/>=rSquared
nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.


2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Chris Sampson <ch...@naimuri.com>.

For info, the probes we currently use for our StatefulSet Pods are:

   - livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
   - readinessProbe - exec command to curl the nifi-api/controller/cluster
   endpoint to check the node's cluster connection status, e.g.:

readinessProbe:
> exec:
> command:
> - bash
> - -c
> - |
> if [ "${SECURE}" = "true" ]; then
> INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]'
> | tr ' ' '-')
>
> curl -v \
> --cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
> --cert
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem
> \
> --key
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem
> \
> https://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> else
> curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
> echo "Node not found with CONNECTED state. Full cluster state:"
> jq . /tmp/cluster.state
> exit 1
> fi
>

Note that INITIAL_ADMIN is the CN of a user with appropriate permissions to
call the endpoint and for whom our pod contains a set of certificate files
in the indicated locations (generated from NiFi Toolkit in an
init-container before the Pod starts); jq utility was added into our
customised version of the apache/nifi Docker Image.


---
*Chris Sampson*
IT Consultant
chris.sampson@naimuri.com
<https://www.naimuri.com/>


On Wed, 30 Sep 2020 at 16:43, Wyll Ingersoll <
wyllys.ingersoll@keepertech.com> wrote:

> Thanks for following up and filing the issue. Unfortunately, I dont have
> any of the logs from the original issue since I have since restarted and
> rebooted my containers many times.
> ------------------------------
> *From:* Mark Payne <ma...@hotmail.com>
> *Sent:* Wednesday, September 30, 2020 11:21 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Thanks Wyll,
>
> I created a Jira [1] to address this. The NullPointer that you show in the
> stack trace will prevent the node from reconnecting to the cluster.
> Unfortunately, it’s a bug that needs to be addressed. It’s possible that
> you may find a way to work around the issue, but I can’t tell you off the
> top of my head what that would be.
>
> Can you check the logs for anything else from the StandardFlowService
> class? That may help to understand why the null value is getting returned,
> causing the NullPointerException that you’re seeing.
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-7866
>
> On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
> 1.11.4
> ------------------------------
> *From:* Mark Payne <ma...@hotmail.com>
> *Sent:* Wednesday, September 30, 2020 11:02 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Wyll,
>
> What version of nifi are you running?
>
> Thanks
> -Mark
>
>
> On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>
>    - Yes - the host specific parameters on the different instances are
>    configured correctly (nifi-0, nifi-1, nifi-2)
>    - Yes - we have separate certificate for each node and the keystores
>    are configured correctly.
>    - Yes - we have a headless service in front of the STS cluster
>    - No - I don't think there is an explicit liveness or readiness probe
>    defined for the STS, perhaps I need to add one. Do you have an example?
>
>
> -Wyllys
>
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Tuesday, September 29, 2020 3:21 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> We started to have more stability when we switched to
> bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.
>
> Your properties have nifi-0 in several places, so just to double check
> that the relevant properties are changed for each of the instances within
> your statefulset?
>
> For example:
> * nifi.remote.input.host
> * nifi.cluster.node.address
> * nifi.web.https.host
>
>
> Yes
>
> And are you using a separate (non-wildcard) certificate for each node?
>
>
> Do you have liveness/readiness probes set on your nifi sts?
>
>
> And are you using a headless service[1] to manage the cluster during
> startup?
>
>
> [1]
> https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <
> wyllys.ingersoll@keepertech.com> wrote:
>
> Zookeeper is from the docker hub zookeeper:3.5.7 image.
>
> Below is our nifi.properties (with secrets and hostnames modified).
>
> thanks!
>  - Wyllys
>
>
>
> nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
> nifi.flow.configuration.archive.enabled=true
> nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
> nifi.flow.configuration.archive.max.time=30 days
> nifi.flow.configuration.archive.max.storage=500 MB
> nifi.flow.configuration.archive.max.count=
> nifi.flowcontroller.autoResumeState=false
> nifi.flowcontroller.graceful.shutdown.period=10 sec
> nifi.flowservice.writedelay.interval=500 ms
> nifi.administrative.yield.duration=30 sec
>
> nifi.bored.yield.duration=10 millis
> nifi.queue.backpressure.count=10000
> nifi.queue.backpressure.size=1 GB
>
> nifi.authorizer.configuration.file=./conf/authorizers.xml
>
> nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
> nifi.templates.directory=/opt/nifi/nifi-current/templates
> nifi.ui.banner.text=KI Nifi Cluster
> nifi.ui.autorefresh.interval=30 sec
> nifi.nar.library.directory=./lib
> nifi.nar.library.autoload.directory=./extensions
> nifi.nar.working.directory=./work/nar/
> nifi.documentation.working.directory=./work/docs/components
>
> nifi.state.management.configuration.file=./conf/state-management.xml
> nifi.state.management.provider.local=local-provider
> nifi.state.management.provider.cluster=zk-provider
> nifi.state.management.embedded.zookeeper.start=false
>
> nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
>
> nifi.database.directory=./database_repository
> nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE
>
>
> nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
>
> nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
> nifi.flowfile.repository.directory=./flowfile_repository
> nifi.flowfile.repository.partitions=256
> nifi.flowfile.repository.checkpoint.interval=2 mins
> nifi.flowfile.repository.always.sync=false
> nifi.flowfile.repository.encryption.key.provider.implementation=
> nifi.flowfile.repository.encryption.key.provider.location=
> nifi.flowfile.repository.encryption.key.id=
> nifi.flowfile.repository.encryption.key=
>
>
> nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
> nifi.queue.swap.threshold=20000
> nifi.swap.in.period=5 sec
> nifi.swap.in.threads=1
> nifi.swap.out.period=5 sec
> nifi.swap.out.threads=4
>
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
> nifi.content.claim.max.appendable.size=1 MB
> nifi.content.claim.max.flow.files=100
> nifi.content.repository.directory.default=./content_repository
> nifi.content.repository.archive.max.retention.period=12 hours
> nifi.content.repository.archive.max.usage.percentage=50%
> nifi.content.repository.archive.enabled=true
> nifi.content.repository.always.sync=false
> nifi.content.viewer.url=../nifi-content-viewer/
> nifi.content.repository.encryption.key.provider.implementation=
> nifi.content.repository.encryption.key.provider.location=
> nifi.content.repository.encryption.key.id=
> nifi.content.repository.encryption.key=
>
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
> nifi.provenance.repository.debug.frequency=1_000_000
> nifi.provenance.repository.encryption.key.provider.implementation=
> nifi.provenance.repository.encryption.key.provider.location=
> nifi.provenance.repository.encryption.key.id=
> nifi.provenance.repository.encryption.key=
>
> nifi.provenance.repository.directory.default=./provenance_repository
> nifi.provenance.repository.max.storage.time=7 days
> nifi.provenance.repository.max.storage.size=100 GB
> nifi.provenance.repository.rollover.time=120 secs
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=2
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
> Filename, ProcessorID, Relationship
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=4 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=2
> nifi.provenance.repository.buffer.size=100000
>
>
> nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
> nifi.components.status.repository.buffer.size=1440
> nifi.components.status.snapshot.frequency=1 min
>
> nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.remote.input.secure=true
> nifi.remote.input.socket.port=10000
> nifi.remote.input.http.enabled=true
> nifi.remote.input.http.transaction.ttl=30 sec
> nifi.remote.contents.cache.expiration=30 secs
>
> nifi.web.war.directory=./lib
> nifi.web.http.host=
> nifi.web.http.port=
> nifi.web.http.network.interface.default=
> nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.web.https.port=8080
> nifi.web.https.network.interface.default=
> nifi.web.jetty.working.directory=./work/jetty
> nifi.web.jetty.threads=200
> nifi.web.max.header.size=16 KB
> nifi.web.proxy.context.path=/nifi-api,/nifi
> nifi.web.proxy.host=ingress.ourdomain.com
>
> nifi.sensitive.props.key=
> nifi.sensitive.props.key.protected=
> nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
> nifi.sensitive.props.provider=BC
> nifi.sensitive.props.additional.keys=
>
> nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
> nifi.security.keystoreType=jks
> nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
> nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
>
> nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
> nifi.security.truststoreType=jks
> nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
> nifi.security.user.authorizer=managed-authorizer
> nifi.security.user.login.identity.provider=
> nifi.security.ocsp.responder.url=
> nifi.security.ocsp.responder.certificate=
>
> nifi.security.user.oidc.discovery.url=
> https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
> nifi.security.user.oidc.connect.timeout=15 secs
> nifi.security.user.oidc.read.timeout=15 secs
> nifi.security.user.oidc.client.id=nifi
> nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
> nifi.security.user.oidc.preferred.jwsalgorithm=RS512
> nifi.security.user.oidc.additional.scopes=
> nifi.security.user.oidc.claim.identifying.user=
>
> nifi.security.user.knox.url=
> nifi.security.user.knox.publicKey=
> nifi.security.user.knox.cookieName=hadoop-jwt
> nifi.security.user.knox.audiences=
>
> nifi.cluster.protocol.heartbeat.interval=30 secs
> nifi.cluster.protocol.is.secure=true
>
> nifi.cluster.is.node=true
> nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
> nifi.cluster.node.protocol.port=2882
> nifi.cluster.node.protocol.threads=40
> nifi.cluster.node.protocol.max.threads=50
> nifi.cluster.node.event.history.size=25
> nifi.cluster.node.connection.timeout=120 secs
> nifi.cluster.node.read.timeout=120 secs
> nifi.cluster.node.max.concurrent.requests=100
> nifi.cluster.firewall.file=
> nifi.cluster.flow.election.max.wait.time=5 mins
> nifi.cluster.flow.election.max.candidates=
>
> nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.cluster.load.balance.port=6342
> nifi.cluster.load.balance.connections.per.node=4
> nifi.cluster.load.balance.max.thread.count=8
> nifi.cluster.load.balance.comms.timeout=30 sec
>
>
> nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
> nifi.zookeeper.connect.timeout=30 secs
> nifi.zookeeper.session.timeout=30 secs
> nifi.zookeeper.root.node=/nifi
> nifi.zookeeper.auth.type=
> nifi.zookeeper.kerberos.removeHostFromPrincipal=
> nifi.zookeeper.kerberos.removeRealmFromPrincipal=
>
> nifi.kerberos.krb5.file=
>
> nifi.kerberos.service.principal=
> nifi.kerberos.service.keytab.location=
>
> nifi.kerberos.spnego.principal=
> nifi.kerberos.spnego.keytab.location=
> nifi.kerberos.spnego.authentication.expiration=12 hours
>
> nifi.variable.registry.properties=
>
> nifi.analytics.predict.enabled=false
> nifi.analytics.predict.interval=3 mins
> nifi.analytics.query.interval=5 mins
>
> nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
> nifi.analytics.connection.model.score.name=rSquared
> nifi.analytics.connection.model.score.threshold=.90
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Tuesday, September 29, 2020 12:41 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Also, which version of zookeeper and what image (I've found different
> versions and images provided better stability)?
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com> wrote:
>
> Hello Wyll
>
> It may be helpful if you can send nifi.properties.
>
> Thanks
> Sushil Kumar
>
> On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>
> I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
> StatefulSet) using external zookeeper (3 nodes also) to manage state.
>
> Whenever even 1 node (pod/container) goes down or is restarted, it can
> throw the whole cluster into a bad state that forces me to restart ALL of
> the pods in order to recover.  This seems wrong.  The problem seems to be
> that when the primary node goes away, the remaining 2 nodes don't ever try
> to take over.  Instead, I have restart all of them individually until one
> of them becomes the primary, then the other 2 eventually join and sync up.
>
> When one of the nodes is refusing to sync up, I often see these errors in
> the log and the only way to get it back into the cluster is to restart it.
> The node showing the errors below never seems to be able to rejoin or
> resync with the other 2 nodes.
>
>
> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request
> failed due to: org.apache.nifi.cluster.ConnectionException: Failed to
> connect node to cluster due to: java.lang.NullPointerException
> org.apache.nifi.cluster.ConnectionException: Failed to connect node to
> cluster due to: java.lang.NullPointerException
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
> at
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException: null
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
> ... 4 common frames omitted
> 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Starting
> 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
> 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Default schema
> 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: CONNECTED
> 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
> 2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
> o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
> 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
> 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Primary Node' becuase that role is not registered
> 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Cluster Coordinator' becuase that role is not registered
>
>
>

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

Thanks for following up and filing the issue. Unfortunately, I dont have any of the logs from the original issue since I have since restarted and rebooted my containers many times.
________________________________
From: Mark Payne <ma...@hotmail.com>
Sent: Wednesday, September 30, 2020 11:21 AM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

Thanks Wyll,

I created a Jira [1] to address this. The NullPointer that you show in the stack trace will prevent the node from reconnecting to the cluster. Unfortunately, it’s a bug that needs to be addressed. It’s possible that you may find a way to work around the issue, but I can’t tell you off the top of my head what that would be.

Can you check the logs for anything else from the StandardFlowService class? That may help to understand why the null value is getting returned, causing the NullPointerException that you’re seeing.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-7866

On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:

1.11.4
________________________________
From: Mark Payne <ma...@hotmail.com>>
Sent: Wednesday, September 30, 2020 11:02 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Wyll,

What version of nifi are you running?

Thanks
-Mark

On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:

  *   Yes - the host specific parameters on the different instances are configured correctly (nifi-0, nifi-1, nifi-2)
  *   Yes - we have separate certificate for each node and the keystores are configured correctly.
  *   Yes - we have a headless service in front of the STS cluster
  *   No - I don't think there is an explicit liveness or readiness probe defined for the STS, perhaps I need to add one. Do you have an example?

-Wyllys

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 3:21 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host

Yes

And are you using a separate (non-wildcard) certificate for each node?

Do you have liveness/readiness probes set on your nifi sts?

And are you using a headless service[1] to manage the cluster during startup?

[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services

Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys

nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
nifi.flow.configuration.archive.enabled=true
nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
nifi.flow.configuration.archive.max.time=30 days
nifi.flow.configuration.archive.max.storage=500 MB
nifi.flow.configuration.archive.max.count=
nifi.flowcontroller.autoResumeState=false
nifi.flowcontroller.graceful.shutdown.period=10 sec
nifi.flowservice.writedelay.interval=500 ms
nifi.administrative.yield.duration=30 sec

nifi.bored.yield.duration=10 millis
nifi.queue.backpressure.count=10000
nifi.queue.backpressure.size=1 GB

nifi.authorizer.configuration.file=./conf/authorizers.xml
nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.templates.directory=/opt/nifi/nifi-current/templates
nifi.ui.banner.text=KI Nifi Cluster
nifi.ui.autorefresh.interval=30 sec
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
nifi.documentation.working.directory=./work/docs/components

nifi.state.management.configuration.file=./conf/state-management.xml
nifi.state.management.provider.local=local-provider
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties

nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id/>=
nifi.flowfile.repository.encryption.key=

nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
nifi.swap.in.period=5 sec
nifi.swap.in.threads=1
nifi.swap.out.period=5 sec
nifi.swap.out.threads=4

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id/>=
nifi.content.repository.encryption.key=

nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id/>=
nifi.provenance.repository.encryption.key=

nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=7 days
nifi.provenance.repository.max.storage.size=100 GB
nifi.provenance.repository.rollover.time=120 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
nifi.provenance.repository.indexed.attributes=
nifi.provenance.repository.index.shard.size=4 GB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.buffer.size=100000

nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min

nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10000
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs

nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.http.network.interface.default=
nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
nifi.web.https.port=8080
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=/nifi-api,/nifi
nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com/>

nifi.sensitive.props.key=
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=

nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
nifi.security.user.authorizer=managed-authorizer
nifi.security.user.login.identity.provider=
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=

nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
nifi.security.user.oidc.connect.timeout=15 secs
nifi.security.user.oidc.read.timeout=15 secs
nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id/>=nifi
nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
nifi.security.user.oidc.preferred.jwsalgorithm=RS512
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=

nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=

nifi.cluster.protocol.heartbeat.interval=30 secs
nifi.cluster.protocol.is.secure=true

nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.node.protocol.port=2882
nifi.cluster.node.protocol.threads=40
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=120 secs
nifi.cluster.node.read.timeout=120 secs
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=

nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=

nifi.kerberos.krb5.file=

nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=

nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours

nifi.variable.registry.properties=

nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name/>=rSquared
nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?

Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.

2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Mark Payne <ma...@hotmail.com>.

Thanks Wyll,

I created a Jira [1] to address this. The NullPointer that you show in the stack trace will prevent the node from reconnecting to the cluster. Unfortunately, it’s a bug that needs to be addressed. It’s possible that you may find a way to work around the issue, but I can’t tell you off the top of my head what that would be.

Can you check the logs for anything else from the StandardFlowService class? That may help to understand why the null value is getting returned, causing the NullPointerException that you’re seeing.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-7866

On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:

1.11.4
________________________________
From: Mark Payne <ma...@hotmail.com>>
Sent: Wednesday, September 30, 2020 11:02 AM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Wyll,

What version of nifi are you running?

Thanks
-Mark


On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:


  *   Yes - the host specific parameters on the different instances are configured correctly (nifi-0, nifi-1, nifi-2)
  *   Yes - we have separate certificate for each node and the keystores are configured correctly.
  *   Yes - we have a headless service in front of the STS cluster
  *   No - I don't think there is an explicit liveness or readiness probe defined for the STS, perhaps I need to add one. Do you have an example?

-Wyllys


________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 3:21 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host


Yes

And are you using a separate (non-wildcard) certificate for each node?


Do you have liveness/readiness probes set on your nifi sts?


And are you using a headless service[1] to manage the cluster during startup?


[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys


nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
nifi.flow.configuration.archive.enabled=true
nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
nifi.flow.configuration.archive.max.time=30 days
nifi.flow.configuration.archive.max.storage=500 MB
nifi.flow.configuration.archive.max.count=
nifi.flowcontroller.autoResumeState=false
nifi.flowcontroller.graceful.shutdown.period=10 sec
nifi.flowservice.writedelay.interval=500 ms
nifi.administrative.yield.duration=30 sec

nifi.bored.yield.duration=10 millis
nifi.queue.backpressure.count=10000
nifi.queue.backpressure.size=1 GB

nifi.authorizer.configuration.file=./conf/authorizers.xml
nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.templates.directory=/opt/nifi/nifi-current/templates
nifi.ui.banner.text=KI Nifi Cluster
nifi.ui.autorefresh.interval=30 sec
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
nifi.documentation.working.directory=./work/docs/components

nifi.state.management.configuration.file=./conf/state-management.xml
nifi.state.management.provider.local=local-provider
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties

nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id/>=
nifi.flowfile.repository.encryption.key=

nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
nifi.swap.in.period=5 sec
nifi.swap.in.threads=1
nifi.swap.out.period=5 sec
nifi.swap.out.threads=4

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id/>=
nifi.content.repository.encryption.key=

nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id/>=
nifi.provenance.repository.encryption.key=

nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=7 days
nifi.provenance.repository.max.storage.size=100 GB
nifi.provenance.repository.rollover.time=120 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
nifi.provenance.repository.indexed.attributes=
nifi.provenance.repository.index.shard.size=4 GB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.buffer.size=100000

nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min

nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10000
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs

nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.http.network.interface.default=
nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
nifi.web.https.port=8080
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=/nifi-api,/nifi
nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com/>

nifi.sensitive.props.key=
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=

nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
nifi.security.user.authorizer=managed-authorizer
nifi.security.user.login.identity.provider=
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=

nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
nifi.security.user.oidc.connect.timeout=15 secs
nifi.security.user.oidc.read.timeout=15 secs
nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id/>=nifi
nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
nifi.security.user.oidc.preferred.jwsalgorithm=RS512
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=

nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=

nifi.cluster.protocol.heartbeat.interval=30 secs
nifi.cluster.protocol.is.secure=true

nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.node.protocol.port=2882
nifi.cluster.node.protocol.threads=40
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=120 secs
nifi.cluster.node.read.timeout=120 secs
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=

nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=

nifi.kerberos.krb5.file=

nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=

nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours

nifi.variable.registry.properties=

nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name/>=rSquared
nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.


2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

1.11.4
________________________________
From: Mark Payne <ma...@hotmail.com>
Sent: Wednesday, September 30, 2020 11:02 AM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

Wyll,

What version of nifi are you running?

Thanks
-Mark

On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:

  *   Yes - the host specific parameters on the different instances are configured correctly (nifi-0, nifi-1, nifi-2)
  *   Yes - we have separate certificate for each node and the keystores are configured correctly.
  *   Yes - we have a headless service in front of the STS cluster
  *   No - I don't think there is an explicit liveness or readiness probe defined for the STS, perhaps I need to add one. Do you have an example?

-Wyllys

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 3:21 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host

Yes

And are you using a separate (non-wildcard) certificate for each node?

Do you have liveness/readiness probes set on your nifi sts?

And are you using a headless service[1] to manage the cluster during startup?

[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services

Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys

nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
nifi.flow.configuration.archive.enabled=true
nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
nifi.flow.configuration.archive.max.time=30 days
nifi.flow.configuration.archive.max.storage=500 MB
nifi.flow.configuration.archive.max.count=
nifi.flowcontroller.autoResumeState=false
nifi.flowcontroller.graceful.shutdown.period=10 sec
nifi.flowservice.writedelay.interval=500 ms
nifi.administrative.yield.duration=30 sec

nifi.bored.yield.duration=10 millis
nifi.queue.backpressure.count=10000
nifi.queue.backpressure.size=1 GB

nifi.authorizer.configuration.file=./conf/authorizers.xml
nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.templates.directory=/opt/nifi/nifi-current/templates
nifi.ui.banner.text=KI Nifi Cluster
nifi.ui.autorefresh.interval=30 sec
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
nifi.documentation.working.directory=./work/docs/components

nifi.state.management.configuration.file=./conf/state-management.xml
nifi.state.management.provider.local=local-provider
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties

nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id/>=
nifi.flowfile.repository.encryption.key=

nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
nifi.swap.in.period=5 sec
nifi.swap.in.threads=1
nifi.swap.out.period=5 sec
nifi.swap.out.threads=4

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id/>=
nifi.content.repository.encryption.key=

nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id/>=
nifi.provenance.repository.encryption.key=

nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=7 days
nifi.provenance.repository.max.storage.size=100 GB
nifi.provenance.repository.rollover.time=120 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
nifi.provenance.repository.indexed.attributes=
nifi.provenance.repository.index.shard.size=4 GB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.buffer.size=100000

nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min

nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10000
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs

nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.http.network.interface.default=
nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
nifi.web.https.port=8080
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=/nifi-api,/nifi
nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com/>

nifi.sensitive.props.key=
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=

nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
nifi.security.user.authorizer=managed-authorizer
nifi.security.user.login.identity.provider=
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=

nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
nifi.security.user.oidc.connect.timeout=15 secs
nifi.security.user.oidc.read.timeout=15 secs
nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id/>=nifi
nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
nifi.security.user.oidc.preferred.jwsalgorithm=RS512
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=

nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=

nifi.cluster.protocol.heartbeat.interval=30 secs
nifi.cluster.protocol.is.secure=true

nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.node.protocol.port=2882
nifi.cluster.node.protocol.threads=40
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=120 secs
nifi.cluster.node.read.timeout=120 secs
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=

nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=

nifi.kerberos.krb5.file=

nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=

nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours

nifi.variable.registry.properties=

nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name/>=rSquared
nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?

Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.

2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Mark Payne <ma...@hotmail.com>.

Wyll,

What version of nifi are you running?

Thanks
-Mark


On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <wy...@keepertech.com>> wrote:


  *   Yes - the host specific parameters on the different instances are configured correctly (nifi-0, nifi-1, nifi-2)
  *   Yes - we have separate certificate for each node and the keystores are configured correctly.
  *   Yes - we have a headless service in front of the STS cluster
  *   No - I don't think there is an explicit liveness or readiness probe defined for the STS, perhaps I need to add one. Do you have an example?

-Wyllys


________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 3:21 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host


Yes

And are you using a separate (non-wildcard) certificate for each node?


Do you have liveness/readiness probes set on your nifi sts?


And are you using a headless service[1] to manage the cluster during startup?


[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys


nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
nifi.flow.configuration.archive.enabled=true
nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
nifi.flow.configuration.archive.max.time=30 days
nifi.flow.configuration.archive.max.storage=500 MB
nifi.flow.configuration.archive.max.count=
nifi.flowcontroller.autoResumeState=false
nifi.flowcontroller.graceful.shutdown.period=10 sec
nifi.flowservice.writedelay.interval=500 ms
nifi.administrative.yield.duration=30 sec

nifi.bored.yield.duration=10 millis
nifi.queue.backpressure.count=10000
nifi.queue.backpressure.size=1 GB

nifi.authorizer.configuration.file=./conf/authorizers.xml
nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.templates.directory=/opt/nifi/nifi-current/templates
nifi.ui.banner.text=KI Nifi Cluster
nifi.ui.autorefresh.interval=30 sec
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
nifi.documentation.working.directory=./work/docs/components

nifi.state.management.configuration.file=./conf/state-management.xml
nifi.state.management.provider.local=local-provider
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties

nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE

nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id/>=
nifi.flowfile.repository.encryption.key=

nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
nifi.swap.in.period=5 sec
nifi.swap.in.threads=1
nifi.swap.out.period=5 sec
nifi.swap.out.threads=4

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id/>=
nifi.content.repository.encryption.key=

nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id/>=
nifi.provenance.repository.encryption.key=

nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=7 days
nifi.provenance.repository.max.storage.size=100 GB
nifi.provenance.repository.rollover.time=120 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
nifi.provenance.repository.indexed.attributes=
nifi.provenance.repository.index.shard.size=4 GB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.buffer.size=100000

nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min

nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10000
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs

nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.http.network.interface.default=
nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
nifi.web.https.port=8080
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=/nifi-api,/nifi
nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com/>

nifi.sensitive.props.key=
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=

nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
nifi.security.user.authorizer=managed-authorizer
nifi.security.user.login.identity.provider=
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=

nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
nifi.security.user.oidc.connect.timeout=15 secs
nifi.security.user.oidc.read.timeout=15 secs
nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id/>=nifi
nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
nifi.security.user.oidc.preferred.jwsalgorithm=RS512
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=

nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=

nifi.cluster.protocol.heartbeat.interval=30 secs
nifi.cluster.protocol.is.secure=true

nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.node.protocol.port=2882
nifi.cluster.node.protocol.threads=40
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=120 secs
nifi.cluster.node.read.timeout=120 secs
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=

nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=

nifi.kerberos.krb5.file=

nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=

nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours

nifi.variable.registry.properties=

nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name/>=rSquared
nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.


2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

  *   Yes - the host specific parameters on the different instances are configured correctly (nifi-0, nifi-1, nifi-2)
  *   Yes - we have separate certificate for each node and the keystores are configured correctly.
  *   Yes - we have a headless service in front of the STS cluster
  *   No - I don't think there is an explicit liveness or readiness probe defined for the STS, perhaps I need to add one. Do you have an example?

-Wyllys


________________________________
From: Chris Sampson <ch...@naimuri.com>
Sent: Tuesday, September 29, 2020 3:21 PM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host


Yes

And are you using a separate (non-wildcard) certificate for each node?


Do you have liveness/readiness probes set on your nifi sts?


And are you using a headless service[1] to manage the cluster during startup?


[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys



nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz

nifi.flow.configuration.archive.enabled=true

nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives

nifi.flow.configuration.archive.max.time=30 days

nifi.flow.configuration.archive.max.storage=500 MB

nifi.flow.configuration.archive.max.count=

nifi.flowcontroller.autoResumeState=false

nifi.flowcontroller.graceful.shutdown.period=10 sec

nifi.flowservice.writedelay.interval=500 ms

nifi.administrative.yield.duration=30 sec


nifi.bored.yield.duration=10 millis

nifi.queue.backpressure.count=10000

nifi.queue.backpressure.size=1 GB


nifi.authorizer.configuration.file=./conf/authorizers.xml

nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml

nifi.templates.directory=/opt/nifi/nifi-current/templates

nifi.ui.banner.text=KI Nifi Cluster

nifi.ui.autorefresh.interval=30 sec

nifi.nar.library.directory=./lib

nifi.nar.library.autoload.directory=./extensions

nifi.nar.working.directory=./work/nar/

nifi.documentation.working.directory=./work/docs/components


nifi.state.management.configuration.file=./conf/state-management.xml

nifi.state.management.provider.local=local-provider

nifi.state.management.provider.cluster=zk-provider

nifi.state.management.embedded.zookeeper.start=false

nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties


nifi.database.directory=./database_repository

nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE


nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository

nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog

nifi.flowfile.repository.directory=./flowfile_repository

nifi.flowfile.repository.partitions=256

nifi.flowfile.repository.checkpoint.interval=2 mins

nifi.flowfile.repository.always.sync=false

nifi.flowfile.repository.encryption.key.provider.implementation=

nifi.flowfile.repository.encryption.key.provider.location=

nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id>=

nifi.flowfile.repository.encryption.key=


nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager

nifi.queue.swap.threshold=20000

nifi.swap.in.period=5 sec

nifi.swap.in.threads=1

nifi.swap.out.period=5 sec

nifi.swap.out.threads=4


nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository

nifi.content.claim.max.appendable.size=1 MB

nifi.content.claim.max.flow.files=100

nifi.content.repository.directory.default=./content_repository

nifi.content.repository.archive.max.retention.period=12 hours

nifi.content.repository.archive.max.usage.percentage=50%

nifi.content.repository.archive.enabled=true

nifi.content.repository.always.sync=false

nifi.content.viewer.url=../nifi-content-viewer/

nifi.content.repository.encryption.key.provider.implementation=

nifi.content.repository.encryption.key.provider.location=

nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id>=

nifi.content.repository.encryption.key=


nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository

nifi.provenance.repository.debug.frequency=1_000_000

nifi.provenance.repository.encryption.key.provider.implementation=

nifi.provenance.repository.encryption.key.provider.location=

nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id>=

nifi.provenance.repository.encryption.key=


nifi.provenance.repository.directory.default=./provenance_repository

nifi.provenance.repository.max.storage.time=7 days

nifi.provenance.repository.max.storage.size=100 GB

nifi.provenance.repository.rollover.time=120 secs

nifi.provenance.repository.rollover.size=100 MB

nifi.provenance.repository.query.threads=2

nifi.provenance.repository.index.threads=2

nifi.provenance.repository.compress.on.rollover=true

nifi.provenance.repository.always.sync=false

nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship

nifi.provenance.repository.indexed.attributes=

nifi.provenance.repository.index.shard.size=4 GB

nifi.provenance.repository.max.attribute.length=65536

nifi.provenance.repository.concurrent.merge.threads=2

nifi.provenance.repository.buffer.size=100000


nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository

nifi.components.status.repository.buffer.size=1440

nifi.components.status.snapshot.frequency=1 min


nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local

nifi.remote.input.secure=true

nifi.remote.input.socket.port=10000

nifi.remote.input.http.enabled=true

nifi.remote.input.http.transaction.ttl=30 sec

nifi.remote.contents.cache.expiration=30 secs


nifi.web.war.directory=./lib

nifi.web.http.host=

nifi.web.http.port=

nifi.web.http.network.interface.default=

nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local

nifi.web.https.port=8080

nifi.web.https.network.interface.default=

nifi.web.jetty.working.directory=./work/jetty

nifi.web.jetty.threads=200

nifi.web.max.header.size=16 KB

nifi.web.proxy.context.path=/nifi-api,/nifi

nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com>


nifi.sensitive.props.key=

nifi.sensitive.props.key.protected=

nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL

nifi.sensitive.props.provider=BC

nifi.sensitive.props.additional.keys=


nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks

nifi.security.keystoreType=jks

nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX

nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX

nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks

nifi.security.truststoreType=jks

nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX

nifi.security.user.authorizer=managed-authorizer

nifi.security.user.login.identity.provider=

nifi.security.ocsp.responder.url=

nifi.security.ocsp.responder.certificate=


nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration

nifi.security.user.oidc.connect.timeout=15 secs

nifi.security.user.oidc.read.timeout=15 secs

nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id>=nifi

nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX

nifi.security.user.oidc.preferred.jwsalgorithm=RS512

nifi.security.user.oidc.additional.scopes=

nifi.security.user.oidc.claim.identifying.user=


nifi.security.user.knox.url=

nifi.security.user.knox.publicKey=

nifi.security.user.knox.cookieName=hadoop-jwt

nifi.security.user.knox.audiences=


nifi.cluster.protocol.heartbeat.interval=30 secs

nifi.cluster.protocol.is.secure=true


nifi.cluster.is.node=true

nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local

nifi.cluster.node.protocol.port=2882

nifi.cluster.node.protocol.threads=40

nifi.cluster.node.protocol.max.threads=50

nifi.cluster.node.event.history.size=25

nifi.cluster.node.connection.timeout=120 secs

nifi.cluster.node.read.timeout=120 secs

nifi.cluster.node.max.concurrent.requests=100

nifi.cluster.firewall.file=

nifi.cluster.flow.election.max.wait.time=5 mins

nifi.cluster.flow.election.max.candidates=


nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local

nifi.cluster.load.balance.port=6342

nifi.cluster.load.balance.connections.per.node=4

nifi.cluster.load.balance.max.thread.count=8

nifi.cluster.load.balance.comms.timeout=30 sec


nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181

nifi.zookeeper.connect.timeout=30 secs

nifi.zookeeper.session.timeout=30 secs

nifi.zookeeper.root.node=/nifi

nifi.zookeeper.auth.type=

nifi.zookeeper.kerberos.removeHostFromPrincipal=

nifi.zookeeper.kerberos.removeRealmFromPrincipal=


nifi.kerberos.krb5.file=


nifi.kerberos.service.principal=

nifi.kerberos.service.keytab.location=


nifi.kerberos.spnego.principal=

nifi.kerberos.spnego.keytab.location=

nifi.kerberos.spnego.authentication.expiration=12 hours


nifi.variable.registry.properties=


nifi.analytics.predict.enabled=false

nifi.analytics.predict.interval=3 mins

nifi.analytics.query.interval=5 mins

nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares

nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name>=rSquared

nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org<ma...@nifi.apache.org> <us...@nifi.apache.org>>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.



2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException

org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException

at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)

at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)

at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)

at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.NullPointerException: null

at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)

... 4 common frames omitted

2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting

2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes

2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema

2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED

2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>}

2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting

2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>}

2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered

2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Chris Sampson <ch...@naimuri.com>.

We started to have more stability when we switched to
bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.

Your properties have nifi-0 in several places, so just to double check that
the relevant properties are changed for each of the instances within your
statefulset?

For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host

And are you using a separate (non-wildcard) certificate for each node?


Do you have liveness/readiness probes set on your nifi sts?


And are you using a headless service[1] to manage the cluster during
startup?


[1]
https://kubernetes.io/docs/concepts/services-networking/service/#headless-services


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <wy...@keepertech.com>
wrote:

> Zookeeper is from the docker hub zookeeper:3.5.7 image.
>
> Below is our nifi.properties (with secrets and hostnames modified).
>
> thanks!
>  - Wyllys
>
>
>
> nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
>
> nifi.flow.configuration.archive.enabled=true
>
> nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
>
> nifi.flow.configuration.archive.max.time=30 days
>
> nifi.flow.configuration.archive.max.storage=500 MB
>
> nifi.flow.configuration.archive.max.count=
>
> nifi.flowcontroller.autoResumeState=false
>
> nifi.flowcontroller.graceful.shutdown.period=10 sec
>
> nifi.flowservice.writedelay.interval=500 ms
>
> nifi.administrative.yield.duration=30 sec
>
>
> nifi.bored.yield.duration=10 millis
>
> nifi.queue.backpressure.count=10000
>
> nifi.queue.backpressure.size=1 GB
>
>
> nifi.authorizer.configuration.file=./conf/authorizers.xml
>
>
> nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
>
> nifi.templates.directory=/opt/nifi/nifi-current/templates
>
> nifi.ui.banner.text=KI Nifi Cluster
>
> nifi.ui.autorefresh.interval=30 sec
>
> nifi.nar.library.directory=./lib
>
> nifi.nar.library.autoload.directory=./extensions
>
> nifi.nar.working.directory=./work/nar/
>
> nifi.documentation.working.directory=./work/docs/components
>
>
> nifi.state.management.configuration.file=./conf/state-management.xml
>
> nifi.state.management.provider.local=local-provider
>
> nifi.state.management.provider.cluster=zk-provider
>
> nifi.state.management.embedded.zookeeper.start=false
>
>
> nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
>
>
> nifi.database.directory=./database_repository
>
> nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE
>
>
>
> nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
>
>
> nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
>
> nifi.flowfile.repository.directory=./flowfile_repository
>
> nifi.flowfile.repository.partitions=256
>
> nifi.flowfile.repository.checkpoint.interval=2 mins
>
> nifi.flowfile.repository.always.sync=false
>
> nifi.flowfile.repository.encryption.key.provider.implementation=
>
> nifi.flowfile.repository.encryption.key.provider.location=
>
> nifi.flowfile.repository.encryption.key.id=
>
> nifi.flowfile.repository.encryption.key=
>
>
>
> nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
>
> nifi.queue.swap.threshold=20000
>
> nifi.swap.in.period=5 sec
>
> nifi.swap.in.threads=1
>
> nifi.swap.out.period=5 sec
>
> nifi.swap.out.threads=4
>
>
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>
> nifi.content.claim.max.appendable.size=1 MB
>
> nifi.content.claim.max.flow.files=100
>
> nifi.content.repository.directory.default=./content_repository
>
> nifi.content.repository.archive.max.retention.period=12 hours
>
> nifi.content.repository.archive.max.usage.percentage=50%
>
> nifi.content.repository.archive.enabled=true
>
> nifi.content.repository.always.sync=false
>
> nifi.content.viewer.url=../nifi-content-viewer/
>
> nifi.content.repository.encryption.key.provider.implementation=
>
> nifi.content.repository.encryption.key.provider.location=
>
> nifi.content.repository.encryption.key.id=
>
> nifi.content.repository.encryption.key=
>
>
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
>
> nifi.provenance.repository.debug.frequency=1_000_000
>
> nifi.provenance.repository.encryption.key.provider.implementation=
>
> nifi.provenance.repository.encryption.key.provider.location=
>
> nifi.provenance.repository.encryption.key.id=
>
> nifi.provenance.repository.encryption.key=
>
>
> nifi.provenance.repository.directory.default=./provenance_repository
>
> nifi.provenance.repository.max.storage.time=7 days
>
> nifi.provenance.repository.max.storage.size=100 GB
>
> nifi.provenance.repository.rollover.time=120 secs
>
> nifi.provenance.repository.rollover.size=100 MB
>
> nifi.provenance.repository.query.threads=2
>
> nifi.provenance.repository.index.threads=2
>
> nifi.provenance.repository.compress.on.rollover=true
>
> nifi.provenance.repository.always.sync=false
>
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
> Filename, ProcessorID, Relationship
>
> nifi.provenance.repository.indexed.attributes=
>
> nifi.provenance.repository.index.shard.size=4 GB
>
> nifi.provenance.repository.max.attribute.length=65536
>
> nifi.provenance.repository.concurrent.merge.threads=2
>
> nifi.provenance.repository.buffer.size=100000
>
>
>
> nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
>
> nifi.components.status.repository.buffer.size=1440
>
> nifi.components.status.snapshot.frequency=1 min
>
>
> nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
>
> nifi.remote.input.secure=true
>
> nifi.remote.input.socket.port=10000
>
> nifi.remote.input.http.enabled=true
>
> nifi.remote.input.http.transaction.ttl=30 sec
>
> nifi.remote.contents.cache.expiration=30 secs
>
>
> nifi.web.war.directory=./lib
>
> nifi.web.http.host=
>
> nifi.web.http.port=
>
> nifi.web.http.network.interface.default=
>
> nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
>
> nifi.web.https.port=8080
>
> nifi.web.https.network.interface.default=
>
> nifi.web.jetty.working.directory=./work/jetty
>
> nifi.web.jetty.threads=200
>
> nifi.web.max.header.size=16 KB
>
> nifi.web.proxy.context.path=/nifi-api,/nifi
>
> nifi.web.proxy.host=ingress.ourdomain.com
>
>
> nifi.sensitive.props.key=
>
> nifi.sensitive.props.key.protected=
>
> nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
>
> nifi.sensitive.props.provider=BC
>
> nifi.sensitive.props.additional.keys=
>
>
> nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
>
> nifi.security.keystoreType=jks
>
> nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
>
> nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
>
>
> nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
>
> nifi.security.truststoreType=jks
>
> nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
>
> nifi.security.user.authorizer=managed-authorizer
>
> nifi.security.user.login.identity.provider=
>
> nifi.security.ocsp.responder.url=
>
> nifi.security.ocsp.responder.certificate=
>
>
> nifi.security.user.oidc.discovery.url=
> https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
>
> nifi.security.user.oidc.connect.timeout=15 secs
>
> nifi.security.user.oidc.read.timeout=15 secs
>
> nifi.security.user.oidc.client.id=nifi
>
> nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
>
> nifi.security.user.oidc.preferred.jwsalgorithm=RS512
>
> nifi.security.user.oidc.additional.scopes=
>
> nifi.security.user.oidc.claim.identifying.user=
>
>
> nifi.security.user.knox.url=
>
> nifi.security.user.knox.publicKey=
>
> nifi.security.user.knox.cookieName=hadoop-jwt
>
> nifi.security.user.knox.audiences=
>
>
> nifi.cluster.protocol.heartbeat.interval=30 secs
>
> nifi.cluster.protocol.is.secure=true
>
>
> nifi.cluster.is.node=true
>
> nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
>
> nifi.cluster.node.protocol.port=2882
>
> nifi.cluster.node.protocol.threads=40
>
> nifi.cluster.node.protocol.max.threads=50
>
> nifi.cluster.node.event.history.size=25
>
> nifi.cluster.node.connection.timeout=120 secs
>
> nifi.cluster.node.read.timeout=120 secs
>
> nifi.cluster.node.max.concurrent.requests=100
>
> nifi.cluster.firewall.file=
>
> nifi.cluster.flow.election.max.wait.time=5 mins
>
> nifi.cluster.flow.election.max.candidates=
>
>
> nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
>
> nifi.cluster.load.balance.port=6342
>
> nifi.cluster.load.balance.connections.per.node=4
>
> nifi.cluster.load.balance.max.thread.count=8
>
> nifi.cluster.load.balance.comms.timeout=30 sec
>
>
>
> nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
>
> nifi.zookeeper.connect.timeout=30 secs
>
> nifi.zookeeper.session.timeout=30 secs
>
> nifi.zookeeper.root.node=/nifi
>
> nifi.zookeeper.auth.type=
>
> nifi.zookeeper.kerberos.removeHostFromPrincipal=
>
> nifi.zookeeper.kerberos.removeRealmFromPrincipal=
>
>
> nifi.kerberos.krb5.file=
>
>
> nifi.kerberos.service.principal=
>
> nifi.kerberos.service.keytab.location=
>
>
> nifi.kerberos.spnego.principal=
>
> nifi.kerberos.spnego.keytab.location=
>
> nifi.kerberos.spnego.authentication.expiration=12 hours
>
>
> nifi.variable.registry.properties=
>
>
> nifi.analytics.predict.enabled=false
>
> nifi.analytics.predict.interval=3 mins
>
> nifi.analytics.query.interval=5 mins
>
>
> nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
>
> nifi.analytics.connection.model.score.name=rSquared
>
> nifi.analytics.connection.model.score.threshold=.90
>
> ------------------------------
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Tuesday, September 29, 2020 12:41 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Clustered nifi issues
>
> Also, which version of zookeeper and what image (I've found different
> versions and images provided better stability)?
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com> wrote:
>
> Hello Wyll
>
> It may be helpful if you can send nifi.properties.
>
> Thanks
> Sushil Kumar
>
> On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>
> I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
> StatefulSet) using external zookeeper (3 nodes also) to manage state.
>
> Whenever even 1 node (pod/container) goes down or is restarted, it can
> throw the whole cluster into a bad state that forces me to restart ALL of
> the pods in order to recover.  This seems wrong.  The problem seems to be
> that when the primary node goes away, the remaining 2 nodes don't ever try
> to take over.  Instead, I have restart all of them individually until one
> of them becomes the primary, then the other 2 eventually join and sync up.
>
> When one of the nodes is refusing to sync up, I often see these errors in
> the log and the only way to get it back into the cluster is to restart it.
> The node showing the errors below never seems to be able to rejoin or
> resync with the other 2 nodes.
>
>
> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request
> failed due to: org.apache.nifi.cluster.ConnectionException: Failed to
> connect node to cluster due to: java.lang.NullPointerException
>
> org.apache.nifi.cluster.ConnectionException: Failed to connect node to
> cluster due to: java.lang.NullPointerException
>
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
>
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
>
> at
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
>
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.NullPointerException: null
>
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
>
> ... 4 common frames omitted
>
> 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Starting
>
> 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
>
> 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Default schema
>
> 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: CONNECTED
>
> 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
>
> 2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
> o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
>
> 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
>
> 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Primary Node' becuase that role is not registered
>
> 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Cluster Coordinator' becuase that role is not registered
>
>

Re: Clustered nifi issues

Posted by Wyll Ingersoll <wy...@keepertech.com>.

Zookeeper is from the docker hub zookeeper:3.5.7 image.

Below is our nifi.properties (with secrets and hostnames modified).

thanks!
 - Wyllys



nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz

nifi.flow.configuration.archive.enabled=true

nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives

nifi.flow.configuration.archive.max.time=30 days

nifi.flow.configuration.archive.max.storage=500 MB

nifi.flow.configuration.archive.max.count=

nifi.flowcontroller.autoResumeState=false

nifi.flowcontroller.graceful.shutdown.period=10 sec

nifi.flowservice.writedelay.interval=500 ms

nifi.administrative.yield.duration=30 sec


nifi.bored.yield.duration=10 millis

nifi.queue.backpressure.count=10000

nifi.queue.backpressure.size=1 GB


nifi.authorizer.configuration.file=./conf/authorizers.xml

nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml

nifi.templates.directory=/opt/nifi/nifi-current/templates

nifi.ui.banner.text=KI Nifi Cluster

nifi.ui.autorefresh.interval=30 sec

nifi.nar.library.directory=./lib

nifi.nar.library.autoload.directory=./extensions

nifi.nar.working.directory=./work/nar/

nifi.documentation.working.directory=./work/docs/components


nifi.state.management.configuration.file=./conf/state-management.xml

nifi.state.management.provider.local=local-provider

nifi.state.management.provider.cluster=zk-provider

nifi.state.management.embedded.zookeeper.start=false

nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties


nifi.database.directory=./database_repository

nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE


nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository

nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog

nifi.flowfile.repository.directory=./flowfile_repository

nifi.flowfile.repository.partitions=256

nifi.flowfile.repository.checkpoint.interval=2 mins

nifi.flowfile.repository.always.sync=false

nifi.flowfile.repository.encryption.key.provider.implementation=

nifi.flowfile.repository.encryption.key.provider.location=

nifi.flowfile.repository.encryption.key.id=

nifi.flowfile.repository.encryption.key=


nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager

nifi.queue.swap.threshold=20000

nifi.swap.in.period=5 sec

nifi.swap.in.threads=1

nifi.swap.out.period=5 sec

nifi.swap.out.threads=4


nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository

nifi.content.claim.max.appendable.size=1 MB

nifi.content.claim.max.flow.files=100

nifi.content.repository.directory.default=./content_repository

nifi.content.repository.archive.max.retention.period=12 hours

nifi.content.repository.archive.max.usage.percentage=50%

nifi.content.repository.archive.enabled=true

nifi.content.repository.always.sync=false

nifi.content.viewer.url=../nifi-content-viewer/

nifi.content.repository.encryption.key.provider.implementation=

nifi.content.repository.encryption.key.provider.location=

nifi.content.repository.encryption.key.id=

nifi.content.repository.encryption.key=


nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository

nifi.provenance.repository.debug.frequency=1_000_000

nifi.provenance.repository.encryption.key.provider.implementation=

nifi.provenance.repository.encryption.key.provider.location=

nifi.provenance.repository.encryption.key.id=

nifi.provenance.repository.encryption.key=


nifi.provenance.repository.directory.default=./provenance_repository

nifi.provenance.repository.max.storage.time=7 days

nifi.provenance.repository.max.storage.size=100 GB

nifi.provenance.repository.rollover.time=120 secs

nifi.provenance.repository.rollover.size=100 MB

nifi.provenance.repository.query.threads=2

nifi.provenance.repository.index.threads=2

nifi.provenance.repository.compress.on.rollover=true

nifi.provenance.repository.always.sync=false

nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship

nifi.provenance.repository.indexed.attributes=

nifi.provenance.repository.index.shard.size=4 GB

nifi.provenance.repository.max.attribute.length=65536

nifi.provenance.repository.concurrent.merge.threads=2

nifi.provenance.repository.buffer.size=100000


nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository

nifi.components.status.repository.buffer.size=1440

nifi.components.status.snapshot.frequency=1 min


nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local

nifi.remote.input.secure=true

nifi.remote.input.socket.port=10000

nifi.remote.input.http.enabled=true

nifi.remote.input.http.transaction.ttl=30 sec

nifi.remote.contents.cache.expiration=30 secs


nifi.web.war.directory=./lib

nifi.web.http.host=

nifi.web.http.port=

nifi.web.http.network.interface.default=

nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local

nifi.web.https.port=8080

nifi.web.https.network.interface.default=

nifi.web.jetty.working.directory=./work/jetty

nifi.web.jetty.threads=200

nifi.web.max.header.size=16 KB

nifi.web.proxy.context.path=/nifi-api,/nifi

nifi.web.proxy.host=ingress.ourdomain.com


nifi.sensitive.props.key=

nifi.sensitive.props.key.protected=

nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL

nifi.sensitive.props.provider=BC

nifi.sensitive.props.additional.keys=


nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks

nifi.security.keystoreType=jks

nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX

nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX

nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks

nifi.security.truststoreType=jks

nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX

nifi.security.user.authorizer=managed-authorizer

nifi.security.user.login.identity.provider=

nifi.security.ocsp.responder.url=

nifi.security.ocsp.responder.certificate=


nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration

nifi.security.user.oidc.connect.timeout=15 secs

nifi.security.user.oidc.read.timeout=15 secs

nifi.security.user.oidc.client.id=nifi

nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX

nifi.security.user.oidc.preferred.jwsalgorithm=RS512

nifi.security.user.oidc.additional.scopes=

nifi.security.user.oidc.claim.identifying.user=


nifi.security.user.knox.url=

nifi.security.user.knox.publicKey=

nifi.security.user.knox.cookieName=hadoop-jwt

nifi.security.user.knox.audiences=


nifi.cluster.protocol.heartbeat.interval=30 secs

nifi.cluster.protocol.is.secure=true


nifi.cluster.is.node=true

nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local

nifi.cluster.node.protocol.port=2882

nifi.cluster.node.protocol.threads=40

nifi.cluster.node.protocol.max.threads=50

nifi.cluster.node.event.history.size=25

nifi.cluster.node.connection.timeout=120 secs

nifi.cluster.node.read.timeout=120 secs

nifi.cluster.node.max.concurrent.requests=100

nifi.cluster.firewall.file=

nifi.cluster.flow.election.max.wait.time=5 mins

nifi.cluster.flow.election.max.candidates=


nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local

nifi.cluster.load.balance.port=6342

nifi.cluster.load.balance.connections.per.node=4

nifi.cluster.load.balance.max.thread.count=8

nifi.cluster.load.balance.comms.timeout=30 sec


nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181

nifi.zookeeper.connect.timeout=30 secs

nifi.zookeeper.session.timeout=30 secs

nifi.zookeeper.root.node=/nifi

nifi.zookeeper.auth.type=

nifi.zookeeper.kerberos.removeHostFromPrincipal=

nifi.zookeeper.kerberos.removeRealmFromPrincipal=


nifi.kerberos.krb5.file=


nifi.kerberos.service.principal=

nifi.kerberos.service.keytab.location=


nifi.kerberos.spnego.principal=

nifi.kerberos.spnego.keytab.location=

nifi.kerberos.spnego.authentication.expiration=12 hours


nifi.variable.registry.properties=


nifi.analytics.predict.enabled=false

nifi.analytics.predict.interval=3 mins

nifi.analytics.query.interval=5 mins

nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares

nifi.analytics.connection.model.score.name=rSquared

nifi.analytics.connection.model.score.threshold=.90

________________________________
From: Chris Sampson <ch...@naimuri.com>
Sent: Tuesday, September 29, 2020 12:41 PM
To: users@nifi.apache.org <us...@nifi.apache.org>
Subject: Re: Clustered nifi issues

Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com>> wrote:
Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <wy...@keepertech.com>> wrote:

I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw the whole cluster into a bad state that forces me to restart ALL of the pods in order to recover.  This seems wrong.  The problem seems to be that when the primary node goes away, the remaining 2 nodes don't ever try to take over.  Instead, I have restart all of them individually until one of them becomes the primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the log and the only way to get it back into the cluster is to restart it.  The node showing the errors below never seems to be able to rejoin or resync with the other 2 nodes.



2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException

org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster due to: java.lang.NullPointerException

at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)

at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)

at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)

at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.NullPointerException: null

at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)

... 4 common frames omitted

2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Starting

2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes

2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] o.a.c.f.imps.CuratorFrameworkImpl Default schema

2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED

2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>}

2020-09-29 10:18:53,810 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting

2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] o.a.c.framework.imps.EnsembleTracker New config event received: {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, version=0, server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>, server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181>}

2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Primary Node' becuase that role is not registered

2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 'Cluster Coordinator' becuase that role is not registered

Re: Clustered nifi issues

Posted by Chris Sampson <ch...@naimuri.com>.

Also, which version of zookeeper and what image (I've found different
versions and images provided better stability)?


Cheers,

Chris Sampson

On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <sk...@gmail.com> wrote:

> Hello Wyll
>
> It may be helpful if you can send nifi.properties.
>
> Thanks
> Sushil Kumar
>
> On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <
> wyllys.ingersoll@keepertech.com> wrote:
>
>>
>> I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
>> StatefulSet) using external zookeeper (3 nodes also) to manage state.
>>
>> Whenever even 1 node (pod/container) goes down or is restarted, it can
>> throw the whole cluster into a bad state that forces me to restart ALL of
>> the pods in order to recover.  This seems wrong.  The problem seems to be
>> that when the primary node goes away, the remaining 2 nodes don't ever try
>> to take over.  Instead, I have restart all of them individually until one
>> of them becomes the primary, then the other 2 eventually join and sync up.
>>
>> When one of the nodes is refusing to sync up, I often see these errors in
>> the log and the only way to get it back into the cluster is to restart it.
>> The node showing the errors below never seems to be able to rejoin or
>> resync with the other 2 nodes.
>>
>>
>> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
>> o.a.nifi.controller.StandardFlowService Handling reconnection request
>> failed due to: org.apache.nifi.cluster.ConnectionException: Failed to
>> connect node to cluster due to: java.lang.NullPointerException
>>
>> org.apache.nifi.cluster.ConnectionException: Failed to connect node to
>> cluster due to: java.lang.NullPointerException
>>
>> at
>> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
>>
>> at
>> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
>>
>> at
>> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
>>
>> at
>> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
>>
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.NullPointerException: null
>>
>> at
>> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
>>
>> ... 4 common frames omitted
>>
>> 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
>> o.a.c.f.imps.CuratorFrameworkImpl Starting
>>
>> 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
>> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
>>
>> 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
>> o.a.c.f.imps.CuratorFrameworkImpl Default schema
>>
>> 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
>> o.a.c.f.state.ConnectionStateManager State change: CONNECTED
>>
>> 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
>> o.a.c.framework.imps.EnsembleTracker New config event received:
>> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
>> 0.0.0.0:2181, version=0,
>> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
>> 0.0.0.0:2181,
>> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
>> 0.0.0.0:2181}
>>
>> 2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
>> o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
>>
>> 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
>> o.a.c.framework.imps.EnsembleTracker New config event received:
>> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
>> 0.0.0.0:2181, version=0,
>> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
>> 0.0.0.0:2181,
>> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
>> 0.0.0.0:2181}
>>
>> 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
>> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
>> Role 'Primary Node' becuase that role is not registered
>>
>> 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
>> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
>> Role 'Cluster Coordinator' becuase that role is not registered
>>
>>

Re: Clustered nifi issues

Posted by Sushil Kumar <sk...@gmail.com>.

Hello Wyll

It may be helpful if you can send nifi.properties.

Thanks
Sushil Kumar

On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <
wyllys.ingersoll@keepertech.com> wrote:

>
> I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
> StatefulSet) using external zookeeper (3 nodes also) to manage state.
>
> Whenever even 1 node (pod/container) goes down or is restarted, it can
> throw the whole cluster into a bad state that forces me to restart ALL of
> the pods in order to recover.  This seems wrong.  The problem seems to be
> that when the primary node goes away, the remaining 2 nodes don't ever try
> to take over.  Instead, I have restart all of them individually until one
> of them becomes the primary, then the other 2 eventually join and sync up.
>
> When one of the nodes is refusing to sync up, I often see these errors in
> the log and the only way to get it back into the cluster is to restart it.
> The node showing the errors below never seems to be able to rejoin or
> resync with the other 2 nodes.
>
>
> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request
> failed due to: org.apache.nifi.cluster.ConnectionException: Failed to
> connect node to cluster due to: java.lang.NullPointerException
>
> org.apache.nifi.cluster.ConnectionException: Failed to connect node to
> cluster due to: java.lang.NullPointerException
>
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
>
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
>
> at
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
>
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.NullPointerException: null
>
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
>
> ... 4 common frames omitted
>
> 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Starting
>
> 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
>
> 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Default schema
>
> 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: CONNECTED
>
> 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
>
> 2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
> o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
>
> 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
>
> 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Primary Node' becuase that role is not registered
>
> 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Cluster Coordinator' becuase that role is not registered
>
>