You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Martijn Dekkers <ma...@dekkers.org.uk> on 2018/11/06 15:38:26 UTC

Re: [EXT] Re: Cluster Warnings

Did you ever get to the bottom of this Karthik? I have seen similar
issue, and turned out to be related to hardware not keeping up with the
increased volume of work (which is why a node was added to begin with)

On Fri, 26 Oct 2018, at 19:18, Karthik Kothareddy (karthikk) [CONT - Type 2] wrote:> Any light on this? We’re still seeing warnings every 10 seconds which
> is really annoying and no idea what’s triggering them.>  


> -Karthik


>  


> *From:* Karthik Kothareddy (karthikk) [CONT - Type 2] *Sent:* Tuesday,
> October 16, 2018 8:27 AM *To:* users@nifi.apache.org *Subject:* RE:
> [EXT] Re: Cluster Warnings>  


> Below is some of the nifi-app.log when DEBUG mode is enabled.


>  


> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3]
> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Received response from
> NIFI01:8443 for GET /nifi-api/site-to-site*> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3]
> o.a.n.c.c.h.r.StandardAsyncClusterResponse Received response 4 out of
> 4 for c910f24a-96fe-4efc-a8b8-6ef8d2674524 from NIFI01:8443*> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3]
> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged
> response is ready for c910f24a-96fe-4efc-a8b8-6ef8d2674524*> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3]
> o.a.n.c.c.h.r.ThreadPoolRequestReplicator For GET /nifi-api/site-to-
> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524), minimum
> response time = 2, max = 224, average = 59.0 ms*> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3]
> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-
> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524):*> *NIFI04:8443: 2 millis*


> *NIFI02:8443: 5 millis*


> *NIFI01:8443: 224 millis*


> *NIFI03:8443: 5 millis*


> * *


> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3]
> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Notified monitor
> java.lang.Object@1da27fc4 because request GET
> https://NIFI03:8443/nifi-api/site-to-site has completed*> *2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048]
> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Unlocked  java.util.concurr-
> ent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read locks = 0]
> after replication completed for GET
> https://NIFI03:8443/nifi-api/site-to-site*> *2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048]
> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged
> response is complete for c910f24a-96fe-4efc-a8b8-6ef8d2674524*> *2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048]
> o.a.n.c.c.h.r.StandardAsyncClusterResponse For GET
> https://NIFI03:8443/nifi-api/site-to-site Timing Info is as follows:*> *Verify Cluster State for All Nodes took 0 millis*


> *Wait for HTTP Request Replication to be triggered for NIFI04:8443
> took 0 millis*> *Wait for HTTP Request Replication to be triggered for NIFI02:8443
> took 0 millis*> *Wait for HTTP Request Replication to be triggered for NIFI01:8443
> took 0 millis*> *Wait for HTTP Request Replication to be triggered for NIFI03:8443
> took 0 millis*> *Perform HTTP Request for NIFI04:8443 took 2 millis*


> *Perform HTTP Request for NIFI03:8443 took 5 millis*


> *Perform HTTP Request for NIFI02:8443 took 5 millis*


> *Perform HTTP Request for NIFI01:8443 took 224 millis*


> *Phase Completed for All Nodes took 224 millis*


> *Notifying All Threads that Request is Complete took 0 millis*


> *Total Time for All Nodes took 224 millis*


> *Map/Merge Responses for All Nodes took 0 millis*


>  


> -Karthik


>  


> *From:* Joe Witt [mailto:joe.witt@gmail.com] *Sent:* Tuesday, October
> 16, 2018 8:03 AM *To:* users@nifi.apache.org *Subject:* Re: [EXT] Re:
> Cluster Warnings>  


> karthik


>  


> understood.  do you have those logs?


>  


>  


>  


> On Tue, Oct 16, 2018, 9:59 AM Karthik Kothareddy (karthikk) [CONT -
> Type 2] <ka...@micron.com> wrote:>> Joe,


>>  


>> The slow node is Node04 in this case but we get one such slow
>> response from a random node(Node01, Node02,Node03) every time we see
>> this warning.>>  


>> -Karthik


>>  


>> *From:* Joe Witt [mailto:joe.witt@gmail.com] *Sent:* Tuesday, October
>> 16, 2018 7:55 AM *To:* users@nifi.apache.org *Subject:* [EXT] Re:
>> Cluster Warnings>>  


>> the logs show the fourth node is the slowest by far in all cases.
>> possibly a dns or other related issue?  but def focus on that node as
>> the outlier and presuming nifi config is identical it suggest
>> system/network differences from other nodes.>>  


>> thanks


>>  


>> On Tue, Oct 16, 2018, 9:51 AM Karthik Kothareddy (karthikk) [CONT -
>> Type 2] <ka...@micron.com> wrote:>>>  


>>> Hello,


>>>  


>>> We’re running a 4-node cluster on NiFi 1.7.1. The fourth node was
>>> added recently and as soon as we added the 4th node, we started
>>> seeing below warnings>>>  


>>> *Response time from NODE2 was slow for each of the last 3 requests
>>> made. To see more information about timing, enable DEBUG logging for
>>> org.apache.nifi.cluster.coordination.http.replication.ThreadPoolReq-
>>> uestReplicator*>>> * *


>>> Initially we though the problem was with the recent node added and
>>> cross checked all the configs on the box and everything seemed to be
>>> just fine. After enabling the DEBUG mode for cluster logging we
>>> noticed that the warning is not specific to any node and every-time
>>> we see a warning like above there is one slow node which takes
>>> forever to send a response like below (in this case the slow node is
>>> NIFI04). Sometimes these will lead to node-disconnects needing a
>>> manual intervention.>>>  


>>> *DEBUG [Replicate Request Thread-50]
>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-
>>> site (Request ID b2c6e983-5233-4007-bd54-13d21b7068d5):*>>> *NIFI04:8443: 1386 millis*


>>> *NIFI02:8443: 3 millis*


>>> *NIFI01:8443: 5 millis*


>>> *NIFI03:8443: 3 millis*


>>> *DEBUG [Replicate Request Thread-41]
>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-
>>> site (Request ID d182fdab-f1d4-4ac9-97fd-e24c41dc4622):*>>> *NIFI04:8443: 1143 millis*


>>> *NIFI02:8443: 22 millis*


>>> *NIFI01:8443: 3 millis*


>>> *NIFI03:8443: 2 millis*


>>> *DEBUG [Replicate Request Thread-31]
>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-
>>> site (Request ID e4726027-27c7-4bbb-8ab6-d02bb41f1920):*>>> *NIFI04:8443: 1053 millis*


>>> *NIFI02:8443: 3 millis*


>>> *NIFI01:8443: 3 millis*


>>> *NIFI03:8443: 2 millis*


>>>  


>>> We tried changing the configurations in nifi.properties like bumping
>>> up the “nifi.cluster.node.protocol.max.threads” but none of them
>>> seems to be working and we’re still stuck with the slow
>>> communication between the nodes. We use an external zookeeper as
>>> this is our production server.>>> Below are some of our configs


>>>  


>>> *# cluster node properties (only configure for cluster nodes) #*


>>> *nifi.cluster.is.node=true*


>>> *nifi.cluster.node.address=fslhdppnifi01.imfs.micron.com*


>>> *nifi.cluster.node.protocol.port=11443*


>>> *nifi.cluster.node.protocol.threads=100*


>>> *nifi.cluster.node.protocol.max.threads=120*


>>> *nifi.cluster.node.event.history.size=25*


>>> *nifi.cluster.node.connection.timeout=90 sec*


>>> *nifi.cluster.node.read.timeout=90 sec*


>>> *nifi.cluster.node.max.concurrent.requests=1000*


>>> *nifi.cluster.firewall.file=*


>>> *nifi.cluster.flow.election.max.wait.time=30 sec*


>>> *nifi.cluster.flow.election.max.candidates=*


>>>  


>>> Any thoughts on why this is happening?


>>>  


>>>  


>>> -Karthik

RE: [EXT] Re: Cluster Warnings

Posted by "Karthik Kothareddy (karthikk) [CONT - Type 2]" <ka...@micron.com>.

Joe, We haven’t moved to 1.8.0 yet as we have to deploy on lower environments (stand-alone) and then move up the channel. With that being said, it’s going to take at least two weeks for us to get 1.8.0 to our clusters. So we will crosscheck all our environments and get back to you if we find anything unusual.

Martijn, we do run on Linux (RHEL 6) and I will talk to our administration team to see if we can deploy netdata find any disk/network issues.

Thanks both of you

-Karthik


From: Martijn Dekkers [mailto:martijn@dekkers.org.uk]
Sent: Tuesday, November 06, 2018 9:03 AM
To: users@nifi.apache.org
Subject: Re: [EXT] Re: Cluster Warnings

In my case it was a combined disk/network issue.

If you run on Linux I can strongly recommend you have a look at netdata which is a 5 minute deploy (and clean remove) and will give you a great way to correlate various metrics without any drama.

Martijn


On Tue, 6 Nov 2018, at 16:53, Karthik Kothareddy (karthikk) [CONT - Type 2] wrote:

Hello Martijn,



No... we’re still seeing these warnings very frequently and sometimes node disconnects as well. We tried everything we could and was not able to get to the bottom of this. When you say, “related to hardware” can you please point me to the specific issue as we double checked all our boxes for any issues and everything seemed normal.



-Karthik



From: Martijn Dekkers [mailto:martijn@dekkers.org.uk]
Sent: Tuesday, November 06, 2018 8:38 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: Re: [EXT] Re: Cluster Warnings



Did you ever get to the bottom of this Karthik? I have seen similar issue, and turned out to be related to hardware not keeping up with the increased volume of work (which is why a node was added to begin with)





On Fri, 26 Oct 2018, at 19:18, Karthik Kothareddy (karthikk) [CONT - Type 2] wrote:

Any light on this? We’re still seeing warnings every 10 seconds which is really annoying and no idea what’s triggering them.



-Karthik


From: Karthik Kothareddy (karthikk) [CONT - Type 2]
Sent: Tuesday, October 16, 2018 8:27 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXT] Re: Cluster Warnings



Below is some of the nifi-app.log when DEBUG mode is enabled.



2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Received response from NIFI01:8443 for GET /nifi-api/site-to-site

2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.StandardAsyncClusterResponse Received response 4 out of 4 for c910f24a-96fe-4efc-a8b8-6ef8d2674524 from NIFI01:8443

2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged response is ready for c910f24a-96fe-4efc-a8b8-6ef8d2674524

2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator For GET /nifi-api/site-to-site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524), minimum response time = 2, max = 224, average = 59.0 ms

2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524):

NIFI04:8443: 2 millis

NIFI02:8443: 5 millis

NIFI01:8443: 224 millis

NIFI03:8443: 5 millis



2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Notified monitor java.lang.Object@1da27fc4<ma...@1da27fc4> because request GET https://NIFI03:8443/nifi-api/site-to-site has completed

2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Unlocked java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read<mailto:java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read> locks = 0] after replication completed for GET https://NIFI03:8443/nifi-api/site-to-site

2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged response is complete for c910f24a-96fe-4efc-a8b8-6ef8d2674524

2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.StandardAsyncClusterResponse For GET https://NIFI03:8443/nifi-api/site-to-site Timing Info is as follows:

Verify Cluster State for All Nodes took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI04:8443 took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI02:8443 took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI01:8443 took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI03:8443 took 0 millis

Perform HTTP Request for NIFI04:8443 took 2 millis

Perform HTTP Request for NIFI03:8443 took 5 millis

Perform HTTP Request for NIFI02:8443 took 5 millis

Perform HTTP Request for NIFI01:8443 took 224 millis

Phase Completed for All Nodes took 224 millis

Notifying All Threads that Request is Complete took 0 millis

Total Time for All Nodes took 224 millis

Map/Merge Responses for All Nodes took 0 millis



-Karthik


From: Joe Witt [mailto:joe.witt@gmail.com]
Sent: Tuesday, October 16, 2018 8:03 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: Re: [EXT] Re: Cluster Warnings



karthik



understood.  do you have those logs?







On Tue, Oct 16, 2018, 9:59 AM Karthik Kothareddy (karthikk) [CONT - Type 2] <ka...@micron.com>> wrote:

Joe,



The slow node is Node04 in this case but we get one such slow response from a random node(Node01, Node02,Node03) every time we see this warning.



-Karthik


From: Joe Witt [mailto:joe.witt@gmail.com<ma...@gmail.com>]
Sent: Tuesday, October 16, 2018 7:55 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: [EXT] Re: Cluster Warnings



the logs show the fourth node is the slowest by far in all cases.  possibly a dns or other related issue?  but def focus on that node as the outlier and presuming nifi config is identical it suggest system/network differences from other nodes.



thanks



On Tue, Oct 16, 2018, 9:51 AM Karthik Kothareddy (karthikk) [CONT - Type 2] <ka...@micron.com>> wrote:



Hello,



We’re running a 4-node cluster on NiFi 1.7.1. The fourth node was added recently and as soon as we added the 4th node, we started seeing below warnings



Response time from NODE2 was slow for each of the last 3 requests made. To see more information about timing, enable DEBUG logging for org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator



Initially we though the problem was with the recent node added and cross checked all the configs on the box and everything seemed to be just fine. After enabling the DEBUG mode for cluster logging we noticed that the warning is not specific to any node and every-time we see a warning like above there is one slow node which takes forever to send a response like below (in this case the slow node is NIFI04). Sometimes these will lead to node-disconnects needing a manual intervention.



DEBUG [Replicate Request Thread-50] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID b2c6e983-5233-4007-bd54-13d21b7068d5):

NIFI04:8443: 1386 millis

NIFI02:8443: 3 millis

NIFI01:8443: 5 millis

NIFI03:8443: 3 millis

DEBUG [Replicate Request Thread-41] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID d182fdab-f1d4-4ac9-97fd-e24c41dc4622):

NIFI04:8443: 1143 millis

NIFI02:8443: 22 millis

NIFI01:8443: 3 millis

NIFI03:8443: 2 millis

DEBUG [Replicate Request Thread-31] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID e4726027-27c7-4bbb-8ab6-d02bb41f1920):

NIFI04:8443: 1053 millis

NIFI02:8443: 3 millis

NIFI01:8443: 3 millis

NIFI03:8443: 2 millis



We tried changing the configurations in nifi.properties like bumping up the “nifi.cluster.node.protocol.max.threads” but none of them seems to be working and we’re still stuck with the slow communication between the nodes. We use an external zookeeper as this is our production server.

Below are some of our configs



# cluster node properties (only configure for cluster nodes) #

nifi.cluster.is.node=true

nifi.cluster.node.address=fslhdppnifi01.imfs.micron.com<http://fslhdppnifi01.imfs.micron.com>

nifi.cluster.node.protocol.port=11443

nifi.cluster.node.protocol.threads=100

nifi.cluster.node.protocol.max.threads=120

nifi.cluster.node.event.history.size=25

nifi.cluster.node.connection.timeout=90 sec

nifi.cluster.node.read.timeout=90 sec

nifi.cluster.node.max.concurrent.requests=1000

nifi.cluster.firewall.file=

nifi.cluster.flow.election.max.wait.time=30 sec

nifi.cluster.flow.election.max.candidates=



Any thoughts on why this is happening?





-Karthik

Re: [EXT] Re: Cluster Warnings

Posted by Martijn Dekkers <ma...@dekkers.org.uk>.

In my case it was a combined disk/network issue.

If you run on Linux I can strongly recommend you have a look at netdata
which is a 5 minute deploy (and clean remove) and will give you a great
way to correlate various metrics without any drama.
Martijn


On Tue, 6 Nov 2018, at 16:53, Karthik Kothareddy (karthikk) [CONT - Type 2] wrote:> Hello Martijn,


>  


> No... we’re still seeing these warnings very frequently and sometimes
> node disconnects as well. We tried everything we could and was not
> able to get to the bottom of this. When you say, “related to hardware”
> can you please point me to the specific issue as we double checked all
> our boxes for any issues and everything seemed normal.>  


> -Karthik


>  


> *From:* Martijn Dekkers [mailto:martijn@dekkers.org.uk] *Sent:*
> Tuesday, November 06, 2018 8:38 AM *To:* users@nifi.apache.org
> *Subject:* Re: [EXT] Re: Cluster Warnings>  


> Did you ever get to the bottom of this Karthik? I have seen similar
> issue, and turned out to be related to hardware not keeping up with
> the increased volume of work (which is why a node was added to
> begin with)>  


>  


> On Fri, 26 Oct 2018, at 19:18, Karthik Kothareddy (karthikk) [CONT -
> Type 2] wrote:>> Any light on this? We’re still seeing warnings every 10 seconds which
>> is really annoying and no idea what’s triggering them.>>  


>> -Karthik


>>  


>> *From:* Karthik Kothareddy (karthikk) [CONT - Type 2] 
>> *Sent:* Tuesday, October 16, 2018 8:27 AM
>> *To:* users@nifi.apache.org
>> *Subject:* RE: [EXT] Re: Cluster Warnings


>>  


>> Below is some of the nifi-app.log when DEBUG mode is enabled.


>>  


>> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Received response from
>> NIFI01:8443 for GET /nifi-api/site-to-site*>> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse Received response 4 out of
>> 4 for c910f24a-96fe-4efc-a8b8-6ef8d2674524 from NIFI01:8443*>> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged
>> response is ready for c910f24a-96fe-4efc-a8b8-6ef8d2674524*>> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator For GET /nifi-api/site-to-
>> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524), minimum
>> response time = 2, max = 224, average = 59.0 ms*>> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-
>> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524):*>> *NIFI04:8443: 2 millis*


>> *NIFI02:8443: 5 millis*


>> *NIFI01:8443: 224 millis*


>> *NIFI03:8443: 5 millis*


>> * *


>> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Notified monitor
>> java.lang.Object@1da27fc4 because request GET
>> https://NIFI03:8443/nifi-api/site-to-site has completed*>> *2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Unlocked  java.util.concur-
>> rent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read locks = 0]
>> after replication completed for GET
>> https://NIFI03:8443/nifi-api/site-to-site*>> *2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged
>> response is complete for c910f24a-96fe-4efc-a8b8-6ef8d2674524*>> *2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse For GET
>> https://NIFI03:8443/nifi-api/site-to-site Timing Info is as follows:*>> *Verify Cluster State for All Nodes took 0 millis*


>> *Wait for HTTP Request Replication to be triggered for NIFI04:8443
>> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for NIFI02:8443
>> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for NIFI01:8443
>> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for NIFI03:8443
>> took 0 millis*>> *Perform HTTP Request for NIFI04:8443 took 2 millis*


>> *Perform HTTP Request for NIFI03:8443 took 5 millis*


>> *Perform HTTP Request for NIFI02:8443 took 5 millis*


>> *Perform HTTP Request for NIFI01:8443 took 224 millis*


>> *Phase Completed for All Nodes took 224 millis*


>> *Notifying All Threads that Request is Complete took 0 millis*


>> *Total Time for All Nodes took 224 millis*


>> *Map/Merge Responses for All Nodes took 0 millis*


>>  


>> -Karthik


>>  


>> *From:* Joe Witt [mailto:joe.witt@gmail.com] 
>> *Sent:* Tuesday, October 16, 2018 8:03 AM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: [EXT] Re: Cluster Warnings


>>  


>> karthik


>>  


>> understood.  do you have those logs?


>>  


>>  


>>  


>> On Tue, Oct 16, 2018, 9:59 AM Karthik Kothareddy (karthikk) [CONT -
>> Type 2] <ka...@micron.com> wrote:>>> Joe,


>>>  


>>> The slow node is Node04 in this case but we get one such slow
>>> response from a random node(Node01, Node02,Node03) every time we see
>>> this warning.>>>  


>>> -Karthik


>>>  


>>> *From:* Joe Witt [mailto:joe.witt@gmail.com] 
>>> *Sent:* Tuesday, October 16, 2018 7:55 AM
>>> *To:* users@nifi.apache.org
>>> *Subject:* [EXT] Re: Cluster Warnings


>>>  


>>> the logs show the fourth node is the slowest by far in all cases.
>>> possibly a dns or other related issue?  but def focus on that node
>>> as the outlier and presuming nifi config is identical it suggest
>>> system/network differences from other nodes.>>>  


>>> thanks


>>>  


>>> On Tue, Oct 16, 2018, 9:51 AM Karthik Kothareddy (karthikk) [CONT -
>>> Type 2] <ka...@micron.com> wrote:>>>>  


>>>> Hello,


>>>>  


>>>> We’re running a 4-node cluster on NiFi 1.7.1. The fourth node was
>>>> added recently and as soon as we added the 4th node, we started
>>>> seeing below warnings>>>>  


>>>> *Response time from NODE2 was slow for each of the last 3 requests
>>>> made. To see more information about timing, enable DEBUG logging
>>>> for org.apache.nifi.cluster.coordination.http.replication.ThreadPo-
>>>> olRequestReplicator*>>>> * *


>>>> Initially we though the problem was with the recent node added and
>>>> cross checked all the configs on the box and everything seemed to
>>>> be just fine. After enabling the DEBUG mode for cluster logging we
>>>> noticed that the warning is not specific to any node and every-time
>>>> we see a warning like above there is one slow node which takes
>>>> forever to send a response like below (in this case the slow node
>>>> is NIFI04). Sometimes these will lead to node-disconnects needing a
>>>> manual intervention.>>>>  


>>>> *DEBUG [Replicate Request Thread-50]
>>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET
>>>> /nifi-api/site-to-site (Request ID b2c6e983-5233-4007-bd54-
>>>> 13d21b7068d5):*>>>> *NIFI04:8443: 1386 millis*


>>>> *NIFI02:8443: 3 millis*


>>>> *NIFI01:8443: 5 millis*


>>>> *NIFI03:8443: 3 millis*


>>>> *DEBUG [Replicate Request Thread-41]
>>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET
>>>> /nifi-api/site-to-site (Request ID d182fdab-f1d4-4ac9-97fd-
>>>> e24c41dc4622):*>>>> *NIFI04:8443: 1143 millis*


>>>> *NIFI02:8443: 22 millis*


>>>> *NIFI01:8443: 3 millis*


>>>> *NIFI03:8443: 2 millis*


>>>> *DEBUG [Replicate Request Thread-31]
>>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET
>>>> /nifi-api/site-to-site (Request ID e4726027-27c7-4bbb-8ab6-
>>>> d02bb41f1920):*>>>> *NIFI04:8443: 1053 millis*


>>>> *NIFI02:8443: 3 millis*


>>>> *NIFI01:8443: 3 millis*


>>>> *NIFI03:8443: 2 millis*


>>>>  


>>>> We tried changing the configurations in nifi.properties like
>>>> bumping up the “nifi.cluster.node.protocol.max.threads” but none of
>>>> them seems to be working and we’re still stuck with the slow
>>>> communication between the nodes. We use an external zookeeper as
>>>> this is our production server.>>>> Below are some of our configs


>>>>  


>>>> *# cluster node properties (only configure for cluster nodes) #*>>>> *nifi.cluster.is.node=true*


>>>> *nifi.cluster.node.address=fslhdppnifi01.imfs.micron.com*


>>>> *nifi.cluster.node.protocol.port=11443*


>>>> *nifi.cluster.node.protocol.threads=100*


>>>> *nifi.cluster.node.protocol.max.threads=120*


>>>> *nifi.cluster.node.event.history.size=25*


>>>> *nifi.cluster.node.connection.timeout=90 sec*


>>>> *nifi.cluster.node.read.timeout=90 sec*


>>>> *nifi.cluster.node.max.concurrent.requests=1000*


>>>> *nifi.cluster.firewall.file=*


>>>> *nifi.cluster.flow.election.max.wait.time=30 sec*


>>>> *nifi.cluster.flow.election.max.candidates=*


>>>>  


>>>> Any thoughts on why this is happening?


>>>>  


>>>>  


>>>> -Karthik


>

Re: [EXT] Re: Cluster Warnings

Posted by Joe Witt <jo...@gmail.com>.

Karthik

Have you given Apache NiFi 1.8.0 a go yet?  I'm not sure it will
resolve anything but again it is really hard to know what you're
seeing.  That said, we know clusters at large scale and under heavy
load are quite stable usually so its either environmental/DNS/etc.. or
its something in NiFi we've not seen.

Thanks
On Tue, Nov 6, 2018 at 10:54 AM Karthik Kothareddy (karthikk) [CONT -
Type 2] <ka...@micron.com> wrote:
>
> Hello Martijn,
>
>
>
> No... we’re still seeing these warnings very frequently and sometimes node disconnects as well. We tried everything we could and was not able to get to the bottom of this. When you say, “related to hardware” can you please point me to the specific issue as we double checked all our boxes for any issues and everything seemed normal.
>
>
>
> -Karthik
>
>
>
> From: Martijn Dekkers [mailto:martijn@dekkers.org.uk]
> Sent: Tuesday, November 06, 2018 8:38 AM
> To: users@nifi.apache.org
> Subject: Re: [EXT] Re: Cluster Warnings
>
>
>
> Did you ever get to the bottom of this Karthik? I have seen similar issue, and turned out to be related to hardware not keeping up with the increased volume of work (which is why a node was added to begin with)
>
>
>
>
>
> On Fri, 26 Oct 2018, at 19:18, Karthik Kothareddy (karthikk) [CONT - Type 2] wrote:
>
> Any light on this? We’re still seeing warnings every 10 seconds which is really annoying and no idea what’s triggering them.
>
>
>
> -Karthik
>
>
>
> From: Karthik Kothareddy (karthikk) [CONT - Type 2]
> Sent: Tuesday, October 16, 2018 8:27 AM
> To: users@nifi.apache.org
> Subject: RE: [EXT] Re: Cluster Warnings
>
>
>
> Below is some of the nifi-app.log when DEBUG mode is enabled.
>
>
>
> 2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Received response from NIFI01:8443 for GET /nifi-api/site-to-site
>
> 2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.StandardAsyncClusterResponse Received response 4 out of 4 for c910f24a-96fe-4efc-a8b8-6ef8d2674524 from NIFI01:8443
>
> 2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged response is ready for c910f24a-96fe-4efc-a8b8-6ef8d2674524
>
> 2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator For GET /nifi-api/site-to-site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524), minimum response time = 2, max = 224, average = 59.0 ms
>
> 2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524):
>
> NIFI04:8443: 2 millis
>
> NIFI02:8443: 5 millis
>
> NIFI01:8443: 224 millis
>
> NIFI03:8443: 5 millis
>
>
>
> 2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Notified monitor java.lang.Object@1da27fc4 because request GET https://NIFI03:8443/nifi-api/site-to-site has completed
>
> 2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Unlocked java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read locks = 0] after replication completed for GET https://NIFI03:8443/nifi-api/site-to-site
>
> 2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged response is complete for c910f24a-96fe-4efc-a8b8-6ef8d2674524
>
> 2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.StandardAsyncClusterResponse For GET https://NIFI03:8443/nifi-api/site-to-site Timing Info is as follows:
>
> Verify Cluster State for All Nodes took 0 millis
>
> Wait for HTTP Request Replication to be triggered for NIFI04:8443 took 0 millis
>
> Wait for HTTP Request Replication to be triggered for NIFI02:8443 took 0 millis
>
> Wait for HTTP Request Replication to be triggered for NIFI01:8443 took 0 millis
>
> Wait for HTTP Request Replication to be triggered for NIFI03:8443 took 0 millis
>
> Perform HTTP Request for NIFI04:8443 took 2 millis
>
> Perform HTTP Request for NIFI03:8443 took 5 millis
>
> Perform HTTP Request for NIFI02:8443 took 5 millis
>
> Perform HTTP Request for NIFI01:8443 took 224 millis
>
> Phase Completed for All Nodes took 224 millis
>
> Notifying All Threads that Request is Complete took 0 millis
>
> Total Time for All Nodes took 224 millis
>
> Map/Merge Responses for All Nodes took 0 millis
>
>
>
> -Karthik
>
>
>
> From: Joe Witt [mailto:joe.witt@gmail.com]
> Sent: Tuesday, October 16, 2018 8:03 AM
> To: users@nifi.apache.org
> Subject: Re: [EXT] Re: Cluster Warnings
>
>
>
> karthik
>
>
>
> understood.  do you have those logs?
>
>
>
>
>
>
>
> On Tue, Oct 16, 2018, 9:59 AM Karthik Kothareddy (karthikk) [CONT - Type 2] <ka...@micron.com> wrote:
>
> Joe,
>
>
>
> The slow node is Node04 in this case but we get one such slow response from a random node(Node01, Node02,Node03) every time we see this warning.
>
>
>
> -Karthik
>
>
>
> From: Joe Witt [mailto:joe.witt@gmail.com]
> Sent: Tuesday, October 16, 2018 7:55 AM
> To: users@nifi.apache.org
> Subject: [EXT] Re: Cluster Warnings
>
>
>
> the logs show the fourth node is the slowest by far in all cases.  possibly a dns or other related issue?  but def focus on that node as the outlier and presuming nifi config is identical it suggest system/network differences from other nodes.
>
>
>
> thanks
>
>
>
> On Tue, Oct 16, 2018, 9:51 AM Karthik Kothareddy (karthikk) [CONT - Type 2] <ka...@micron.com> wrote:
>
>
>
> Hello,
>
>
>
> We’re running a 4-node cluster on NiFi 1.7.1. The fourth node was added recently and as soon as we added the 4th node, we started seeing below warnings
>
>
>
> Response time from NODE2 was slow for each of the last 3 requests made. To see more information about timing, enable DEBUG logging for org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator
>
>
>
> Initially we though the problem was with the recent node added and cross checked all the configs on the box and everything seemed to be just fine. After enabling the DEBUG mode for cluster logging we noticed that the warning is not specific to any node and every-time we see a warning like above there is one slow node which takes forever to send a response like below (in this case the slow node is NIFI04). Sometimes these will lead to node-disconnects needing a manual intervention.
>
>
>
> DEBUG [Replicate Request Thread-50] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID b2c6e983-5233-4007-bd54-13d21b7068d5):
>
> NIFI04:8443: 1386 millis
>
> NIFI02:8443: 3 millis
>
> NIFI01:8443: 5 millis
>
> NIFI03:8443: 3 millis
>
> DEBUG [Replicate Request Thread-41] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID d182fdab-f1d4-4ac9-97fd-e24c41dc4622):
>
> NIFI04:8443: 1143 millis
>
> NIFI02:8443: 22 millis
>
> NIFI01:8443: 3 millis
>
> NIFI03:8443: 2 millis
>
> DEBUG [Replicate Request Thread-31] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID e4726027-27c7-4bbb-8ab6-d02bb41f1920):
>
> NIFI04:8443: 1053 millis
>
> NIFI02:8443: 3 millis
>
> NIFI01:8443: 3 millis
>
> NIFI03:8443: 2 millis
>
>
>
> We tried changing the configurations in nifi.properties like bumping up the “nifi.cluster.node.protocol.max.threads” but none of them seems to be working and we’re still stuck with the slow communication between the nodes. We use an external zookeeper as this is our production server.
>
> Below are some of our configs
>
>
>
> # cluster node properties (only configure for cluster nodes) #
>
> nifi.cluster.is.node=true
>
> nifi.cluster.node.address=fslhdppnifi01.imfs.micron.com
>
> nifi.cluster.node.protocol.port=11443
>
> nifi.cluster.node.protocol.threads=100
>
> nifi.cluster.node.protocol.max.threads=120
>
> nifi.cluster.node.event.history.size=25
>
> nifi.cluster.node.connection.timeout=90 sec
>
> nifi.cluster.node.read.timeout=90 sec
>
> nifi.cluster.node.max.concurrent.requests=1000
>
> nifi.cluster.firewall.file=
>
> nifi.cluster.flow.election.max.wait.time=30 sec
>
> nifi.cluster.flow.election.max.candidates=
>
>
>
> Any thoughts on why this is happening?
>
>
>
>
>
> -Karthik
>
>

RE: [EXT] Re: Cluster Warnings

Posted by "Karthik Kothareddy (karthikk) [CONT - Type 2]" <ka...@micron.com>.

Hello Martijn,

No... we’re still seeing these warnings very frequently and sometimes node disconnects as well. We tried everything we could and was not able to get to the bottom of this. When you say, “related to hardware” can you please point me to the specific issue as we double checked all our boxes for any issues and everything seemed normal.

-Karthik

From: Martijn Dekkers [mailto:martijn@dekkers.org.uk]
Sent: Tuesday, November 06, 2018 8:38 AM
To: users@nifi.apache.org
Subject: Re: [EXT] Re: Cluster Warnings

Did you ever get to the bottom of this Karthik? I have seen similar issue, and turned out to be related to hardware not keeping up with the increased volume of work (which is why a node was added to begin with)

On Fri, 26 Oct 2018, at 19:18, Karthik Kothareddy (karthikk) [CONT - Type 2] wrote:

Any light on this? We’re still seeing warnings every 10 seconds which is really annoying and no idea what’s triggering them.

-Karthik

From: Karthik Kothareddy (karthikk) [CONT - Type 2]
Sent: Tuesday, October 16, 2018 8:27 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXT] Re: Cluster Warnings

Below is some of the nifi-app.log when DEBUG mode is enabled.

2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Received response from NIFI01:8443 for GET /nifi-api/site-to-site

2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.StandardAsyncClusterResponse Received response 4 out of 4 for c910f24a-96fe-4efc-a8b8-6ef8d2674524 from NIFI01:8443

2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged response is ready for c910f24a-96fe-4efc-a8b8-6ef8d2674524

2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator For GET /nifi-api/site-to-site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524), minimum response time = 2, max = 224, average = 59.0 ms

2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524):

NIFI04:8443: 2 millis

NIFI02:8443: 5 millis

NIFI01:8443: 224 millis

NIFI03:8443: 5 millis

2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Notified monitor java.lang.Object@1da27fc4<ma...@1da27fc4> because request GET https://NIFI03:8443/nifi-api/site-to-site has completed

2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Unlocked java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read<mailto:java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read> locks = 0] after replication completed for GET https://NIFI03:8443/nifi-api/site-to-site

2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged response is complete for c910f24a-96fe-4efc-a8b8-6ef8d2674524

2018-10-16 14:21:11,578 DEBUG [NiFi Web Server-304048] o.a.n.c.c.h.r.StandardAsyncClusterResponse For GET https://NIFI03:8443/nifi-api/site-to-site Timing Info is as follows:

Verify Cluster State for All Nodes took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI04:8443 took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI02:8443 took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI01:8443 took 0 millis

Wait for HTTP Request Replication to be triggered for NIFI03:8443 took 0 millis

Perform HTTP Request for NIFI04:8443 took 2 millis

Perform HTTP Request for NIFI03:8443 took 5 millis

Perform HTTP Request for NIFI02:8443 took 5 millis

Perform HTTP Request for NIFI01:8443 took 224 millis

Phase Completed for All Nodes took 224 millis

Notifying All Threads that Request is Complete took 0 millis

Total Time for All Nodes took 224 millis

Map/Merge Responses for All Nodes took 0 millis

-Karthik

From: Joe Witt [mailto:joe.witt@gmail.com]
Sent: Tuesday, October 16, 2018 8:03 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: Re: [EXT] Re: Cluster Warnings

karthik

understood.  do you have those logs?

On Tue, Oct 16, 2018, 9:59 AM Karthik Kothareddy (karthikk) [CONT - Type 2] <ka...@micron.com>> wrote:

Joe,

The slow node is Node04 in this case but we get one such slow response from a random node(Node01, Node02,Node03) every time we see this warning.

-Karthik

From: Joe Witt [mailto:joe.witt@gmail.com<ma...@gmail.com>]
Sent: Tuesday, October 16, 2018 7:55 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: [EXT] Re: Cluster Warnings

the logs show the fourth node is the slowest by far in all cases.  possibly a dns or other related issue?  but def focus on that node as the outlier and presuming nifi config is identical it suggest system/network differences from other nodes.

thanks

On Tue, Oct 16, 2018, 9:51 AM Karthik Kothareddy (karthikk) [CONT - Type 2] <ka...@micron.com>> wrote:

Hello,

We’re running a 4-node cluster on NiFi 1.7.1. The fourth node was added recently and as soon as we added the 4th node, we started seeing below warnings

Response time from NODE2 was slow for each of the last 3 requests made. To see more information about timing, enable DEBUG logging for org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator

Initially we though the problem was with the recent node added and cross checked all the configs on the box and everything seemed to be just fine. After enabling the DEBUG mode for cluster logging we noticed that the warning is not specific to any node and every-time we see a warning like above there is one slow node which takes forever to send a response like below (in this case the slow node is NIFI04). Sometimes these will lead to node-disconnects needing a manual intervention.

DEBUG [Replicate Request Thread-50] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID b2c6e983-5233-4007-bd54-13d21b7068d5):

NIFI04:8443: 1386 millis

NIFI02:8443: 3 millis

NIFI01:8443: 5 millis

NIFI03:8443: 3 millis

DEBUG [Replicate Request Thread-41] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID d182fdab-f1d4-4ac9-97fd-e24c41dc4622):

NIFI04:8443: 1143 millis

NIFI02:8443: 22 millis

NIFI01:8443: 3 millis

NIFI03:8443: 2 millis

DEBUG [Replicate Request Thread-31] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET /nifi-api/site-to-site (Request ID e4726027-27c7-4bbb-8ab6-d02bb41f1920):

NIFI04:8443: 1053 millis

NIFI02:8443: 3 millis

NIFI01:8443: 3 millis

NIFI03:8443: 2 millis

We tried changing the configurations in nifi.properties like bumping up the “nifi.cluster.node.protocol.max.threads” but none of them seems to be working and we’re still stuck with the slow communication between the nodes. We use an external zookeeper as this is our production server.

Below are some of our configs

# cluster node properties (only configure for cluster nodes) #

nifi.cluster.is.node=true

nifi.cluster.node.address=fslhdppnifi01.imfs.micron.com<http://fslhdppnifi01.imfs.micron.com>

nifi.cluster.node.protocol.port=11443

nifi.cluster.node.protocol.threads=100

nifi.cluster.node.protocol.max.threads=120

nifi.cluster.node.event.history.size=25

nifi.cluster.node.connection.timeout=90 sec

nifi.cluster.node.read.timeout=90 sec

nifi.cluster.node.max.concurrent.requests=1000

nifi.cluster.firewall.file=

nifi.cluster.flow.election.max.wait.time=30 sec

nifi.cluster.flow.election.max.candidates=

Any thoughts on why this is happening?

-Karthik