You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Maksym Skrynnikov <sk...@verizonmedia.com> on 2021/01/27 13:05:06 UTC

NiFI queues become imbalanced

Hello!

I am digging into the issue after migration from NiFi 1.9 to NiFi 1.12 when
the cluster becomes imbalanced over time and we have to manually intervene
to rebalance flow files again.
[image: nifi.jpg]
This is the setup I have and *matched *connection is set to do *Round
Robin* balancing.
Rest 3 processors suppose to run on the same node to avoid file
transfers over the network.
The problem is when manually interfere and tasks are re-balanced, nodes
start to do an even amount of work but over time it starts to divert and
some nodes do much less work than others, on the other hand, I have nodes
with very large queues and it causes processing delays. Looking at the
charts I notice over some time I have this "dropping effect" when the
amount of tasks per node suddenly drops.
[image: chart.jpg]
Could someone possibly know what's going on there? Why queues are piling up
on some nodes (different every time) and get imbalanced? And how to prevent
them from doing this, how to make it balanced over cluster as I have
some nodes struggling while other nodes do nothing.

Thanks

Re: [E] Re: NiFI queues become imbalanced

Posted by Joe Witt <jo...@gmail.com>.
Maksym,

If nifi relaunched then the likely culprit at play is an out of memory
scenario.  This almost always comes down to flow design (avoiding turning
content into large attributes, avoid using processors which indicate
they're memory intensive without understanding what that means for your
data, etc..).  We'd have to be able to see the whole flow.  Have you had a
chance to review Mark Payne's flow design youtube videos by chance?

Thanks

On Thu, Jan 28, 2021 at 4:50 AM Maksym Skrynnikov <
skrynnikov.maksym@verizonmedia.com> wrote:

> Joe Witt,
>
> I have now reduced the number of threads per processor and as you say
> "less is more" here, I can see the same or slightly better performance with
> a much lower amount of threads. I have identified the relaunch of NiFi
> service in logs. So, I focus now on eliminating that relaunch issue so I
> have a clearer view. Backpressure is enabled and sometimes it starts to
> kick on when a node is underperforming. What bothers me is that while 1
> node with a long queue is struggling to process all the tasks rest of the
> nodes are doing nothing. So, I am digging into the balancing issue atm.
>
> Thank you
>
> On Wed, Jan 27, 2021 at 4:38 PM Joe Witt <jo...@gmail.com> wrote:
>
>> Maksym
>>
>> Yeah frankly it should be smooth as butter.  A cluster that large and
>> well provisioned should be extremely fast.  These rates while relatively
>> high seem to be quite small relatively for a cluster of that power.
>>
>> So, do you use any of the back pressure mechanisms?  Load balancing the
>> connections is wise.  But are you setting any back pressure values on the
>> connections?  If the data fetched from S3 can be varying in size then this
>> is even more important.  All nodes, assuming they're of similar capability,
>> should remain rock solidly working against some rate of eps and/or overall
>> volume.  Are you using the demarcation value on PutKafka then to ensure
>> each message is constructed by detecting lines within the files?  Another
>> angle here is to ensure Kafka itself is doing well.
>>
>> 25 nodes total and I see over 200 threads trying to hit Kafka at once.
>> This suggests you have 8-10 threads per node writing to Kafka?  That also
>> should not be necessary.  Less should be more there frankly.  What type of
>> ACK and such is set there.
>>
>> Again we'd need to understand a lot more detail about the overall picture
>> and settings at play here but what you're attempting to do should be very
>> achievable/reliable/stable for sure.
>>
>> Thanks
>>
>> On Wed, Jan 27, 2021 at 9:31 AM Maksym Skrynnikov <
>> skrynnikov.maksym@verizonmedia.com> wrote:
>>
>>> Joe,
>>>
>>> Kafka - files can be quite large (300-400 Mb), the publisher sends
>>> contents of the file line by line, so each new line of the contents is
>>> pushed as a separate message but per my understanding publisher processor
>>> does it in batches.
>>> NiFi cluster is 25 nodes running on AWS, 36 cores + 200GB EBS volume
>>> each.
>>> GC metrics do not look out of the ordinary.
>>>
>>> What I notice that there are pick times in the data, so there are some
>>> hours when NiFi gets to process more data and some nodes become unable to
>>> keep up with the pace and their queue is growing, usually, a coordinator is
>>> among those nodes as well. What bothers me is that there is no rebalancing
>>> happening and bad nodes with big queues become even worse. If I stop
>>> rebalancing the processor and change it to no balance and save, wait 10
>>> seconds and change it back to Round Robin all nodes get an equal amount of
>>> work, and the queue is draining quite fast but over time there are again
>>> busy nodes with large queues and others do nothing.
>>>
>>> Thank you.
>>>
>>> On Wed, Jan 27, 2021 at 3:49 PM Joe Witt <jo...@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> There are likely a lot more details needed to fully appreciate all that
>>>> is going on here.
>>>>
>>>> For sending to Kafka are you sending the entire flowfile content as a
>>>> single message?
>>>>
>>>> Have you looked at GC performance on the nodes to see if there is a
>>>> correlation?
>>>>
>>>> This is likely something that will require a more dynamic/interactive
>>>> model to help get to the bottom of.  It appears to be pretty high end rates
>>>> with a large NiFi cluster.  What version?  What type of underlying
>>>> infrastructure?
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Jan 27, 2021 at 6:05 AM Maksym Skrynnikov <
>>>> skrynnikov.maksym@verizonmedia.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> I am digging into the issue after migration from NiFi 1.9 to NiFi 1.12
>>>>> when the cluster becomes imbalanced over time and we have to manually
>>>>> intervene to rebalance flow files again.
>>>>> [image: nifi.jpg]
>>>>> This is the setup I have and *matched *connection is set to do *Round
>>>>> Robin* balancing. Rest 3 processors suppose to run on the same node
>>>>> to avoid file transfers over the network.
>>>>> The problem is when manually interfere and tasks are re-balanced,
>>>>> nodes start to do an even amount of work but over time it starts to divert
>>>>> and some nodes do much less work than others, on the other hand, I have
>>>>> nodes with very large queues and it causes processing delays. Looking at
>>>>> the charts I notice over some time I have this "dropping effect" when the
>>>>> amount of tasks per node suddenly drops.
>>>>> [image: chart.jpg]
>>>>> Could someone possibly know what's going on there? Why queues are
>>>>> piling up on some nodes (different every time) and get imbalanced? And how
>>>>> to prevent them from doing this, how to make it balanced over cluster as I
>>>>> have some nodes struggling while other nodes do nothing.
>>>>>
>>>>> Thanks
>>>>>
>>>>

Re: [E] Re: NiFI queues become imbalanced

Posted by Maksym Skrynnikov <sk...@verizonmedia.com>.
Joe Witt,

I have now reduced the number of threads per processor and as you say "less
is more" here, I can see the same or slightly better performance with a
much lower amount of threads. I have identified the relaunch of NiFi
service in logs. So, I focus now on eliminating that relaunch issue so I
have a clearer view. Backpressure is enabled and sometimes it starts to
kick on when a node is underperforming. What bothers me is that while 1
node with a long queue is struggling to process all the tasks rest of the
nodes are doing nothing. So, I am digging into the balancing issue atm.

Thank you

On Wed, Jan 27, 2021 at 4:38 PM Joe Witt <jo...@gmail.com> wrote:

> Maksym
>
> Yeah frankly it should be smooth as butter.  A cluster that large and well
> provisioned should be extremely fast.  These rates while relatively high
> seem to be quite small relatively for a cluster of that power.
>
> So, do you use any of the back pressure mechanisms?  Load balancing the
> connections is wise.  But are you setting any back pressure values on the
> connections?  If the data fetched from S3 can be varying in size then this
> is even more important.  All nodes, assuming they're of similar capability,
> should remain rock solidly working against some rate of eps and/or overall
> volume.  Are you using the demarcation value on PutKafka then to ensure
> each message is constructed by detecting lines within the files?  Another
> angle here is to ensure Kafka itself is doing well.
>
> 25 nodes total and I see over 200 threads trying to hit Kafka at once.
> This suggests you have 8-10 threads per node writing to Kafka?  That also
> should not be necessary.  Less should be more there frankly.  What type of
> ACK and such is set there.
>
> Again we'd need to understand a lot more detail about the overall picture
> and settings at play here but what you're attempting to do should be very
> achievable/reliable/stable for sure.
>
> Thanks
>
> On Wed, Jan 27, 2021 at 9:31 AM Maksym Skrynnikov <
> skrynnikov.maksym@verizonmedia.com> wrote:
>
>> Joe,
>>
>> Kafka - files can be quite large (300-400 Mb), the publisher sends
>> contents of the file line by line, so each new line of the contents is
>> pushed as a separate message but per my understanding publisher processor
>> does it in batches.
>> NiFi cluster is 25 nodes running on AWS, 36 cores + 200GB EBS volume
>> each.
>> GC metrics do not look out of the ordinary.
>>
>> What I notice that there are pick times in the data, so there are some
>> hours when NiFi gets to process more data and some nodes become unable to
>> keep up with the pace and their queue is growing, usually, a coordinator is
>> among those nodes as well. What bothers me is that there is no rebalancing
>> happening and bad nodes with big queues become even worse. If I stop
>> rebalancing the processor and change it to no balance and save, wait 10
>> seconds and change it back to Round Robin all nodes get an equal amount of
>> work, and the queue is draining quite fast but over time there are again
>> busy nodes with large queues and others do nothing.
>>
>> Thank you.
>>
>> On Wed, Jan 27, 2021 at 3:49 PM Joe Witt <jo...@gmail.com> wrote:
>>
>>> Hello
>>>
>>> There are likely a lot more details needed to fully appreciate all that
>>> is going on here.
>>>
>>> For sending to Kafka are you sending the entire flowfile content as a
>>> single message?
>>>
>>> Have you looked at GC performance on the nodes to see if there is a
>>> correlation?
>>>
>>> This is likely something that will require a more dynamic/interactive
>>> model to help get to the bottom of.  It appears to be pretty high end rates
>>> with a large NiFi cluster.  What version?  What type of underlying
>>> infrastructure?
>>>
>>> Thanks
>>>
>>> On Wed, Jan 27, 2021 at 6:05 AM Maksym Skrynnikov <
>>> skrynnikov.maksym@verizonmedia.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> I am digging into the issue after migration from NiFi 1.9 to NiFi 1.12
>>>> when the cluster becomes imbalanced over time and we have to manually
>>>> intervene to rebalance flow files again.
>>>> [image: nifi.jpg]
>>>> This is the setup I have and *matched *connection is set to do *Round
>>>> Robin* balancing. Rest 3 processors suppose to run on the same node to
>>>> avoid file transfers over the network.
>>>> The problem is when manually interfere and tasks are re-balanced, nodes
>>>> start to do an even amount of work but over time it starts to divert and
>>>> some nodes do much less work than others, on the other hand, I have nodes
>>>> with very large queues and it causes processing delays. Looking at the
>>>> charts I notice over some time I have this "dropping effect" when the
>>>> amount of tasks per node suddenly drops.
>>>> [image: chart.jpg]
>>>> Could someone possibly know what's going on there? Why queues are
>>>> piling up on some nodes (different every time) and get imbalanced? And how
>>>> to prevent them from doing this, how to make it balanced over cluster as I
>>>> have some nodes struggling while other nodes do nothing.
>>>>
>>>> Thanks
>>>>
>>>

Re: [E] Re: NiFI queues become imbalanced

Posted by Joe Witt <jo...@gmail.com>.
Maksym

Yeah frankly it should be smooth as butter.  A cluster that large and well
provisioned should be extremely fast.  These rates while relatively high
seem to be quite small relatively for a cluster of that power.

So, do you use any of the back pressure mechanisms?  Load balancing the
connections is wise.  But are you setting any back pressure values on the
connections?  If the data fetched from S3 can be varying in size then this
is even more important.  All nodes, assuming they're of similar capability,
should remain rock solidly working against some rate of eps and/or overall
volume.  Are you using the demarcation value on PutKafka then to ensure
each message is constructed by detecting lines within the files?  Another
angle here is to ensure Kafka itself is doing well.

25 nodes total and I see over 200 threads trying to hit Kafka at once.
This suggests you have 8-10 threads per node writing to Kafka?  That also
should not be necessary.  Less should be more there frankly.  What type of
ACK and such is set there.

Again we'd need to understand a lot more detail about the overall picture
and settings at play here but what you're attempting to do should be very
achievable/reliable/stable for sure.

Thanks

On Wed, Jan 27, 2021 at 9:31 AM Maksym Skrynnikov <
skrynnikov.maksym@verizonmedia.com> wrote:

> Joe,
>
> Kafka - files can be quite large (300-400 Mb), the publisher sends
> contents of the file line by line, so each new line of the contents is
> pushed as a separate message but per my understanding publisher processor
> does it in batches.
> NiFi cluster is 25 nodes running on AWS, 36 cores + 200GB EBS volume each.
> GC metrics do not look out of the ordinary.
>
> What I notice that there are pick times in the data, so there are some
> hours when NiFi gets to process more data and some nodes become unable to
> keep up with the pace and their queue is growing, usually, a coordinator is
> among those nodes as well. What bothers me is that there is no rebalancing
> happening and bad nodes with big queues become even worse. If I stop
> rebalancing the processor and change it to no balance and save, wait 10
> seconds and change it back to Round Robin all nodes get an equal amount of
> work, and the queue is draining quite fast but over time there are again
> busy nodes with large queues and others do nothing.
>
> Thank you.
>
> On Wed, Jan 27, 2021 at 3:49 PM Joe Witt <jo...@gmail.com> wrote:
>
>> Hello
>>
>> There are likely a lot more details needed to fully appreciate all that
>> is going on here.
>>
>> For sending to Kafka are you sending the entire flowfile content as a
>> single message?
>>
>> Have you looked at GC performance on the nodes to see if there is a
>> correlation?
>>
>> This is likely something that will require a more dynamic/interactive
>> model to help get to the bottom of.  It appears to be pretty high end rates
>> with a large NiFi cluster.  What version?  What type of underlying
>> infrastructure?
>>
>> Thanks
>>
>> On Wed, Jan 27, 2021 at 6:05 AM Maksym Skrynnikov <
>> skrynnikov.maksym@verizonmedia.com> wrote:
>>
>>> Hello!
>>>
>>> I am digging into the issue after migration from NiFi 1.9 to NiFi 1.12
>>> when the cluster becomes imbalanced over time and we have to manually
>>> intervene to rebalance flow files again.
>>> [image: nifi.jpg]
>>> This is the setup I have and *matched *connection is set to do *Round
>>> Robin* balancing. Rest 3 processors suppose to run on the same node to
>>> avoid file transfers over the network.
>>> The problem is when manually interfere and tasks are re-balanced, nodes
>>> start to do an even amount of work but over time it starts to divert and
>>> some nodes do much less work than others, on the other hand, I have nodes
>>> with very large queues and it causes processing delays. Looking at the
>>> charts I notice over some time I have this "dropping effect" when the
>>> amount of tasks per node suddenly drops.
>>> [image: chart.jpg]
>>> Could someone possibly know what's going on there? Why queues are piling
>>> up on some nodes (different every time) and get imbalanced? And how to
>>> prevent them from doing this, how to make it balanced over cluster as I
>>> have some nodes struggling while other nodes do nothing.
>>>
>>> Thanks
>>>
>>

Re: [E] Re: NiFI queues become imbalanced

Posted by Maksym Skrynnikov <sk...@verizonmedia.com>.
Joe,

Kafka - files can be quite large (300-400 Mb), the publisher sends contents
of the file line by line, so each new line of the contents is pushed as
a separate message but per my understanding publisher processor does it in
batches.
NiFi cluster is 25 nodes running on AWS, 36 cores + 200GB EBS volume each.
GC metrics do not look out of the ordinary.

What I notice that there are pick times in the data, so there are some
hours when NiFi gets to process more data and some nodes become unable to
keep up with the pace and their queue is growing, usually, a coordinator is
among those nodes as well. What bothers me is that there is no rebalancing
happening and bad nodes with big queues become even worse. If I stop
rebalancing the processor and change it to no balance and save, wait 10
seconds and change it back to Round Robin all nodes get an equal amount of
work, and the queue is draining quite fast but over time there are again
busy nodes with large queues and others do nothing.

Thank you.

On Wed, Jan 27, 2021 at 3:49 PM Joe Witt <jo...@gmail.com> wrote:

> Hello
>
> There are likely a lot more details needed to fully appreciate all that is
> going on here.
>
> For sending to Kafka are you sending the entire flowfile content as a
> single message?
>
> Have you looked at GC performance on the nodes to see if there is a
> correlation?
>
> This is likely something that will require a more dynamic/interactive
> model to help get to the bottom of.  It appears to be pretty high end rates
> with a large NiFi cluster.  What version?  What type of underlying
> infrastructure?
>
> Thanks
>
> On Wed, Jan 27, 2021 at 6:05 AM Maksym Skrynnikov <
> skrynnikov.maksym@verizonmedia.com> wrote:
>
>> Hello!
>>
>> I am digging into the issue after migration from NiFi 1.9 to NiFi 1.12
>> when the cluster becomes imbalanced over time and we have to manually
>> intervene to rebalance flow files again.
>> [image: nifi.jpg]
>> This is the setup I have and *matched *connection is set to do *Round
>> Robin* balancing. Rest 3 processors suppose to run on the same node to
>> avoid file transfers over the network.
>> The problem is when manually interfere and tasks are re-balanced, nodes
>> start to do an even amount of work but over time it starts to divert and
>> some nodes do much less work than others, on the other hand, I have nodes
>> with very large queues and it causes processing delays. Looking at the
>> charts I notice over some time I have this "dropping effect" when the
>> amount of tasks per node suddenly drops.
>> [image: chart.jpg]
>> Could someone possibly know what's going on there? Why queues are piling
>> up on some nodes (different every time) and get imbalanced? And how to
>> prevent them from doing this, how to make it balanced over cluster as I
>> have some nodes struggling while other nodes do nothing.
>>
>> Thanks
>>
>

Re: NiFI queues become imbalanced

Posted by Joe Witt <jo...@gmail.com>.
Hello

There are likely a lot more details needed to fully appreciate all that is
going on here.

For sending to Kafka are you sending the entire flowfile content as a
single message?

Have you looked at GC performance on the nodes to see if there is a
correlation?

This is likely something that will require a more dynamic/interactive model
to help get to the bottom of.  It appears to be pretty high end rates with
a large NiFi cluster.  What version?  What type of underlying
infrastructure?

Thanks

On Wed, Jan 27, 2021 at 6:05 AM Maksym Skrynnikov <
skrynnikov.maksym@verizonmedia.com> wrote:

> Hello!
>
> I am digging into the issue after migration from NiFi 1.9 to NiFi 1.12
> when the cluster becomes imbalanced over time and we have to manually
> intervene to rebalance flow files again.
> [image: nifi.jpg]
> This is the setup I have and *matched *connection is set to do *Round
> Robin* balancing. Rest 3 processors suppose to run on the same node to
> avoid file transfers over the network.
> The problem is when manually interfere and tasks are re-balanced, nodes
> start to do an even amount of work but over time it starts to divert and
> some nodes do much less work than others, on the other hand, I have nodes
> with very large queues and it causes processing delays. Looking at the
> charts I notice over some time I have this "dropping effect" when the
> amount of tasks per node suddenly drops.
> [image: chart.jpg]
> Could someone possibly know what's going on there? Why queues are piling
> up on some nodes (different every time) and get imbalanced? And how to
> prevent them from doing this, how to make it balanced over cluster as I
> have some nodes struggling while other nodes do nothing.
>
> Thanks
>