You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (JIRA)" <ji...@apache.org> on 2018/08/29 15:38:00 UTC

[jira] [Updated] (NIFI-5533) Improve efficiency of FlowFiles' heap usage

     [ https://issues.apache.org/jira/browse/NIFI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Payne updated NIFI-5533:
-----------------------------
    Fix Version/s: 1.8.0
           Status: Patch Available  (was: Open)

> Improve efficiency of FlowFiles' heap usage
> -------------------------------------------
>
>                 Key: NIFI-5533
>                 URL: https://issues.apache.org/jira/browse/NIFI-5533
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Looking at the code, I see several places that we can improve the heap that NiFi uses for FlowFiles:
>  * When StandardPreparedQuery is used (any time Expression Language is evaluated), it creates a StringBuilder and iterates over all Expressions, evaluating them and concatenating the results together. If there is only a single Expression, though, we can avoid this and just return the value obtained from the Expression. While this will improve the amount of garbage collected, it plays a more important role: it avoids creating a new String object for the FlowFile's attribute map. Currently, if 1 million FlowFiles go through UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1 million copies of that String on the heap. If we just returned the result of evaluating the Expression, we would instead have 1 copy of that String.
>  * Similar to above, it may make sense in UpdateAttribute to cache N number of entries, so that when an expression like ${filename}.txt is evaluated, even though a new String is generated by StandardPreparedQuery, we can resolve that to the same String object when storing as a FlowFile attribute. This would work similar to {{String.intern()}} but not use {{String.intern()}} because we don't want to store an unbounded number of these values in the {{String.intern()}} cache - we want to cap the number of entries, in case the values aren't always reused.
>  * Every FlowFile that is created by StandardProcessSession has a 'filename' attribute added. The value is obtained by calling {{String.valueOf(System.nanoTime());}} This comes with a few downsides. Firstly, the system call is a bit expensive (though not bad). Secondly, the filename is not very unique - it's common with many dataflows and concurrent tasks running to have several FlowFiles with 'naming collisions'. Most of all, though, it means that we are keeping that String on the heap. A simple test shows that instead using the UUID as the default filename resulted in allowing 20% more FlowFiles to be generated on the same heap before running out of heap.
>  * {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap for every call. If we instead created a copy of it once when the StandardProcessContext was created, we could instead just return that one Map every time, since it can't change over the lifetime of the ProcessContext. This is more about garbage collection and general processor performance than about heap utilization but still in the same realm.
> I am sure that there are far more of these nuances but these are certainly worth tackling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)