You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (JIRA)" <ji...@apache.org> on 2018/08/17 16:29:00 UTC

[jira] [Commented] (NIFI-5533) Improve efficiency of FlowFiles' heap usage

    [ https://issues.apache.org/jira/browse/NIFI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584139#comment-16584139 ] 

Mark Payne commented on NIFI-5533:
----------------------------------

Additionally, when we commit a session, we serialize the update to the FlowFileRepository into a ByteArrayOutputStream, then write it to disk in a single call to {{FileOutputStream.write(byte[])}}. This works well for small updates, but for a very large update, such as when we have a long Run Duration or a Split/Merge case, this can result in a lot of heap. It would make sense, instead, for a session commit that has say 10,000 updates (or 1,000 updates) to write to a separate file in the FlowFile Repo's directory, then write to the flowfile repo some sort of "external reference" with the name of the file. In order to ensure the integrity of the FlowFile Repo, we will need to ensure that we perform an fsync() on that file before updating the main 'journal'. Additionally, we would need to ensure that the file is deleted when we perform a checkpoint.

> Improve efficiency of FlowFiles' heap usage
> -------------------------------------------
>
>                 Key: NIFI-5533
>                 URL: https://issues.apache.org/jira/browse/NIFI-5533
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>
> Looking at the code, I see several places that we can improve the heap that NiFi uses for FlowFiles:
>  * When StandardPreparedQuery is used (any time Expression Language is evaluated), it creates a StringBuilder and iterates over all Expressions, evaluating them and concatenating the results together. If there is only a single Expression, though, we can avoid this and just return the value obtained from the Expression. While this will improve the amount of garbage collected, it plays a more important role: it avoids creating a new String object for the FlowFile's attribute map. Currently, if 1 million FlowFiles go through UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1 million copies of that String on the heap. If we just returned the result of evaluating the Expression, we would instead have 1 copy of that String.
>  * Similar to above, it may make sense in UpdateAttribute to cache N number of entries, so that when an expression like ${filename}.txt is evaluated, even though a new String is generated by StandardPreparedQuery, we can resolve that to the same String object when storing as a FlowFile attribute. This would work similar to {{String.intern()}} but not use {{String.intern()}} because we don't want to store an unbounded number of these values in the {{String.intern()}} cache - we want to cap the number of entries, in case the values aren't always reused.
>  * Every FlowFile that is created by StandardProcessSession has a 'filename' attribute added. The value is obtained by calling {{String.valueOf(System.nanoTime());}} This comes with a few downsides. Firstly, the system call is a bit expensive (though not bad). Secondly, the filename is not very unique - it's common with many dataflows and concurrent tasks running to have several FlowFiles with 'naming collisions'. Most of all, though, it means that we are keeping that String on the heap. A simple test shows that instead using the UUID as the default filename resulted in allowing 20% more FlowFiles to be generated on the same heap before running out of heap.
>  * {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap for every call. If we instead created a copy of it once when the StandardProcessContext was created, we could instead just return that one Map every time, since it can't change over the lifetime of the ProcessContext. This is more about garbage collection and general processor performance than about heap utilization but still in the same realm.
> I am sure that there are far more of these nuances but these are certainly worth tackling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)