You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Frederik Petersen (JIRA)" <ji...@apache.org> on 2018/05/22 14:47:00 UTC
[jira] [Comment Edited] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

    [ https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484072#comment-16484072 ] 

Frederik Petersen edited comment on NIFI-5225 at 5/22/18 2:46 PM:
------------------------------------------------------------------

[~joewitt] + [~markap14] thanks!

_Did you verify this addressed your case successfully?_ Yes we are already running a patched 1.5.0 version on our production systems that don't have the original issues anymore.

_Are you in a position to try your usage and provide analysis on the latest apache master?_ Currently we are running HDF-3.1.0.0 and I am not sure if we currently want to fiddle with it to use the latest master. We'd need to change our development environment to more closely replicate what we have on production, but I don't think we currently have the time for that. But I'm to intrigued by the fixed issues (5112 + 5136) as we are currently seeing high latency for web requests.

Something I also noticed while looking into this leak is that SecondPrecisionEventContainer.generateReport() takes up a relatively big amount of time even when the cluster has just been started. Many important resources (like createConnection/Ports/Processor) call FlowController.getGroupStatus, that in turn leads to calling generateReport for all processors/connections. When we instantiate templates/create processors/create connections using the API then this is done many times per component. I think this is quite a waste of resources (and visualvm sampling confirms that, because close to 100% of the sampled Web Threads spend time in the generateReport method). I don't even understand why these stats are extracted for the creation of a component. It's probably some sort of an oversight. And even for the resources that need to supply these stats for the UI, I think it would be good if we could set a flag when using the API, that we are not interested in these stats at all. Just some thoughts I had when reading through the code today.

I think these issues 'hit' us quite hard because we are running nifi on 8 machines and have over a thousand processors in the flow. We've already thought about splitting the flow up due to these issues. But with the patch for this issue, I think we can start going forward and hope that future releases make everything more smooth.


was (Author: frederikp):
[~joewitt] + [~markap14] thanks!


_Did you verify this addressed your case successfully?_ Yes we are already running a patched 1.5.0 version on our production systems that don't have the original issues anymore.

_Are you in a position to try your usage and provide analysis on the latest apache master?_ Currently we are running HDF-3.1.0.0 and I am not sure if we currently want to fiddle with it to use the latest master. We'd need to change our development environment to more closely replicate what we have on production, but I don't think we currently have the time for that. But I'm to intrigued by the fixed issues (5112 + 5136) as we are currently seeing high latency for web requests.

Something I also noticed while looking into this leak is that SecondPrecisionEventContainer.generateReport() takes up a relatively big amount of time even when the cluster has just been started. Many important resources (like createConnection/Ports/Processor) call FlowController.getGroupStatus, that in turn leads to calling generateReport for all processors/connections. When we instantiate templates/create processors/create connections using the API then this is done many times per component. I think this is quite a waste of resources (and visualvm sampling confirms that, because close to 100% of the sampled Web Threads spend time in the generateReport method). I don't even understand why these stats are extracted for the creation of a component. It's probably some sort of an oversight. And even for the resources that need to supply these stats for the UI, I think it would be good if we could set a flag when using to API, that we are not interested in these stats at all. Just some thoughts I had when reading through the code today.

I think these issues 'hit' us quite hard because we are running nifi on 8 machines and have over a thousand processors in the flow. We've already thought about splitting the flow up due to these issues. But with the patch for this issue, I think we can start going forward and hope that future releases make everything more smooth.

> Leak in RingBufferEventRepository for frequently updated flows
> --------------------------------------------------------------
>
>                 Key: NIFI-5225
>                 URL: https://issues.apache.org/jira/browse/NIFI-5225
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>         Environment: HDF-3.1.0.0
>            Reporter: Frederik Petersen
>            Priority: Major
>              Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the past weeks we have noticed that the performance of web requests degrades over time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already stood out that the longer the cluster was running, the more time was spent in 'SecondPrecisionEventContainer.generateReport()' during web requests. This method is already relied on a lot right after starting the cluster (for big flows and process groups). But the time spent in it increases (in our setup) the longer the cluster runs. This increases latency of almost every web request. Our flow reconfiguration script (calling many NiFi API endpoints) went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time should increase over time, because the ring buffers always stay the same size (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>" occupies 5,649,926,328 (55.74%) bytes. The instance is referenced by org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 0x7f86a0000000". The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their data is not removed from the ConcurrentHashMap in RingBufferEventRepository. There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 'reportTransferEvents' is called it iterates over all (meaning more and more over time) entries of the map. This increases latency of every web request and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of NiFi, as we thought it was possible to heavily reconfigure flows using the API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)