You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Jeremy Farbota <jf...@payoff.com> on 2017/09/12 17:38:23 UTC

NiFi cluster sluggish...fine after VMs rebooted

Hello,

We're having an issue recently where flow files get queued up and sit in
the flows (seems like the whole system has back-pressure) seemingly at
random times. A few weeks back, we had a problem where the provenance repo
was not keeping up with the creation of flow files. We switched to a
VolatileProvenanceRepo and that issue was resolved.[1]

Since then, we'll occasionally have the system slow down without any errors
or warnings. It seems that NiFi is unable to move flow files through the
system after they are created. I know it is happening when I see the total
number of flow files spike and a few things queued at weird places (e.g.
replace text, update attribute, etc. where you usually never see anything
queued). Eventually I end up with a huge queue on the last step which
usually involves a controller (PutSql or PutHDFS).

We've found that this issues is resolved by rebooting all of the machines.
Is it possible that the volatile provenance is somehow still getting behind
and when we reboot that is flushed thus removing back-pressure? We're
trying to figure out why the reboot helps. We're also trying to understand
more about different situations where NiFi internally slows down the flows.

We see some normal errors in the logs (e.g. security expired, minor errors
due to a test flow that has issues). No errors that suggest the system is
applying back pressure.

At this point, I'm going to create a reporting process to alert dev ops
when there are files heavily queued so we can get alerted when it's
happening.

Thanks in advance.

[1]
http://apache-nifi.1125220.n5.nabble.com/Rate-of-the-data-flow-is-exceeding-the-provenance-recording-rate-Slowing-down-flow-to-accomodate-td9213.html

[image: Payoff, Inc.]
*Jeremy Farbota*
Software Engineer, Data
Payoff, Inc.

jfarbota@payoff.com

Re: NiFi cluster sluggish...fine after VMs rebooted

Posted by Jeremy Farbota <jf...@payoff.com>.

Yes, these VM's all have new relic installed.

Whenever I have a memory issue, the node seems to go down. We've got a lot
of heap dedicated to these nodes and we're not seeing the memory spike.

When this is happening, we might have 50k flow files system-wide. The thing
is that I've ran processes before where I've had over a million flow files
after a split text and did not have this issue.

The dedicated heap is 26g for each of these nodes. It usually is around 3g
and it will spike if we get a huge dump from kafka but I rarely see it go
over 12g.

[image: Payoff, Inc.]
*Jeremy Farbota*
Software Engineer, Data
Payoff, Inc.

jfarbota@payoff.com
(217) 898-8110 <+2178988110>

On Tue, Sep 12, 2017 at 10:51 AM, Joe Witt <jo...@gmail.com> wrote:

> Jeremy
>
> This sounds then like it is probably a memory pressure issue.  Are you
> monitoring memory usage/GC behavior?
>
> Do you have SplitText being used quite a bit or MergeContent by chance?
> It isn't the size of the content but rather the number of flowfiles in
> memory that can cause sluggish behavior.  Easily worked around but lets
> learn a bit more about the current state.
>
> Thanks
>
> On Tue, Sep 12, 2017 at 1:38 PM, Jeremy Farbota <jf...@payoff.com>
> wrote:
>
>> Hello,
>>
>> We're having an issue recently where flow files get queued up and sit in
>> the flows (seems like the whole system has back-pressure) seemingly at
>> random times. A few weeks back, we had a problem where the provenance repo
>> was not keeping up with the creation of flow files. We switched to a
>> VolatileProvenanceRepo and that issue was resolved.[1]
>>
>>
>> Since then, we'll occasionally have the system slow down without any
>> errors or warnings. It seems that NiFi is unable to move flow files through
>> the system after they are created. I know it is happening when I see the
>> total number of flow files spike and a few things queued at weird places
>> (e.g. replace text, update attribute, etc. where you usually never see
>> anything queued). Eventually I end up with a huge queue on the last step
>> which usually involves a controller (PutSql or PutHDFS).
>>
>>
>> We've found that this issues is resolved by rebooting all of the
>> machines. Is it possible that the volatile provenance is somehow still
>> getting behind and when we reboot that is flushed thus removing
>> back-pressure? We're trying to figure out why the reboot helps. We're also
>> trying to understand more about different situations where NiFi internally
>> slows down the flows.
>>
>>
>> We see some normal errors in the logs (e.g. security expired, minor
>> errors due to a test flow that has issues). No errors that suggest the
>> system is applying back pressure.
>>
>>
>> At this point, I'm going to create a reporting process to alert dev ops
>> when there are files heavily queued so we can get alerted when it's
>> happening.
>>
>>
>> Thanks in advance.
>>
>>
>> [1] http://apache-nifi.1125220.n5.nabble.com/Rate-of-the-
>> data-flow-is-exceeding-the-provenance-recording-rate-Slow
>> ing-down-flow-to-accomodate-td9213.html
>>
>>
>>
>> [image: Payoff, Inc.]
>> *Jeremy Farbota*
>> Software Engineer, Data
>> Payoff, Inc.
>>
>> jfarbota@payoff.com
>>
>
>

Re: NiFi cluster sluggish...fine after VMs rebooted

Posted by Joe Witt <jo...@gmail.com>.

Jeremy

This sounds then like it is probably a memory pressure issue.  Are you
monitoring memory usage/GC behavior?

Do you have SplitText being used quite a bit or MergeContent by chance?  It
isn't the size of the content but rather the number of flowfiles in memory
that can cause sluggish behavior.  Easily worked around but lets learn a
bit more about the current state.

Thanks

On Tue, Sep 12, 2017 at 1:38 PM, Jeremy Farbota <jf...@payoff.com> wrote:

> Hello,
>
> We're having an issue recently where flow files get queued up and sit in
> the flows (seems like the whole system has back-pressure) seemingly at
> random times. A few weeks back, we had a problem where the provenance repo
> was not keeping up with the creation of flow files. We switched to a
> VolatileProvenanceRepo and that issue was resolved.[1]
>
>
> Since then, we'll occasionally have the system slow down without any
> errors or warnings. It seems that NiFi is unable to move flow files through
> the system after they are created. I know it is happening when I see the
> total number of flow files spike and a few things queued at weird places
> (e.g. replace text, update attribute, etc. where you usually never see
> anything queued). Eventually I end up with a huge queue on the last step
> which usually involves a controller (PutSql or PutHDFS).
>
>
> We've found that this issues is resolved by rebooting all of the machines.
> Is it possible that the volatile provenance is somehow still getting behind
> and when we reboot that is flushed thus removing back-pressure? We're
> trying to figure out why the reboot helps. We're also trying to understand
> more about different situations where NiFi internally slows down the flows.
>
>
> We see some normal errors in the logs (e.g. security expired, minor errors
> due to a test flow that has issues). No errors that suggest the system is
> applying back pressure.
>
>
> At this point, I'm going to create a reporting process to alert dev ops
> when there are files heavily queued so we can get alerted when it's
> happening.
>
>
> Thanks in advance.
>
>
> [1] http://apache-nifi.1125220.n5.nabble.com/Rate-of-
> the-data-flow-is-exceeding-the-provenance-recording-rate-
> Slowing-down-flow-to-accomodate-td9213.html
>
>
>
> [image: Payoff, Inc.]
> *Jeremy Farbota*
> Software Engineer, Data
> Payoff, Inc.
>
> jfarbota@payoff.com
>