You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/22 13:33:00 UTC

[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

    [ https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483946#comment-16483946 ] 

ASF GitHub Bot commented on NIFI-5225:
--------------------------------------

GitHub user FrederikP opened a pull request:

    https://github.com/apache/nifi/pull/2732

    NIFI-5225: Purge event data from event repository when Connectable is removed

    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    - [x] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [ ] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
    _Clean install ran through just fine, but contrib-check complained about an unrelated package_
    - [x] Have you written or updated unit tests to verify your changes?
    - [ ] ~~If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?~~ 
    - [ ] ~~If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?~~
    - [ ] ~~If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?~~
    - [ ] ~~If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?~~
    
    ### For documentation related changes:
    - ~~[ ] Have you ensured that format looks appropriate for the output in which it is rendered?~~
    
    I introduced the option to purge data from the FlowFileEventRepository (the 5 min ring buffer) to fix this:
    https://issues.apache.org/jira/browse/NIFI-5225
    
    And it works for our setup.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/FrederikP/nifi master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/2732.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2732
    
----
commit 4e5a118305c9513cca239c136c48239c501e9907
Author: Frederik Petersen <fp...@...>
Date:   2018-05-22T10:55:59Z

    NIFI-5225: Purge event data from event repository when Connectable is removed

----


> Leak in RingBufferEventRepository for frequently updated flows
> --------------------------------------------------------------
>
>                 Key: NIFI-5225
>                 URL: https://issues.apache.org/jira/browse/NIFI-5225
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.5.0, 1.6.0
>         Environment: HDF-3.1.0.0
>            Reporter: Frederik Petersen
>            Priority: Major
>              Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the past weeks we have noticed that the performance of web requests degrades over time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already stood out that the longer the cluster was running, the more time was spent in 'SecondPrecisionEventContainer.generateReport()' during web requests. This method is already relied on a lot right after starting the cluster (for big flows and process groups). But the time spent in it increases (in our setup) the longer the cluster runs. This increases latency of almost every web request. Our flow reconfiguration script (calling many NiFi API endpoints) went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time should increase over time, because the ring buffers always stay the same size (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>" occupies 5,649,926,328 (55.74%) bytes. The instance is referenced by org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 0x7f86a0000000". The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their data is not removed from the ConcurrentHashMap in RingBufferEventRepository. There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 'reportTransferEvents' is called it iterates over all (meaning more and more over time) entries of the map. This increases latency of every web request and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of NiFi, as we thought it was possible to heavily reconfigure flows using the API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)