You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2022/11/28 18:21:00 UTC

[jira] [Updated] (NIFI-10887) Improve Performance of ReplaceText processor

     [ https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Payne updated NIFI-10887:
------------------------------
    Labels: performance  (was: )

> Improve Performance of ReplaceText processor
> --------------------------------------------
>
>                 Key: NIFI-10887
>                 URL: https://issues.apache.org/jira/browse/NIFI-10887
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>              Labels: performance
>
> When performing some tests with the ReplaceText processor, I found that it seemed to be quite a bit slower than I expected, especially when using a Replacement Strategy of "Literal Replace" and when using a lot of small FlowFiles.
> As a result, I performed some profiling and identified a few areas that could use some improvement:
>  * When using the Literal Replace strategy, we  find matches using {{Pattern.compile(Pattern.quote(...));}} and then using {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to just using {{String.indexOf(...)}} and accounted for approximately 30% of the time spent in the processor.
>  * A significant amount of time was spent flushing the write buffer, as it flushes to disk when finished writing to each individual FlowFile. Even when we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets delegated all the way down to the FileOutputStream. However, when using ProcessSession.append(), we intercept this with a NonFlushableOutputStream. We should do this when calling ProcessSession.write() as well. While it makes sense to flush data from the Processor layer's buffer, there's no need to flush past the session layer until the session is committed.
>  * A decent bit of time was spent in the session's get() method calling {{{}final Set<FlowFileRecord> set = unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's hashCode() method, which is the JVM default. We can easily implement hashCode() to just return the hashCode of the identifier, which is a String. This is a pre-computed hashcode so provides constant time of 0 ms (with the exception of the method call itself) so eliminates the expense here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)