You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2022/11/28 18:46:00 UTC

[jira] [Commented] (NIFI-10887) Improve Performance of ReplaceText processor

    [ https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640190#comment-17640190 ] 

Mark Payne commented on NIFI-10887:
-----------------------------------

Benchmarks show that we see significant improvements in performance. Screenshots of performance before and after for both Regex Replace and Literal Replace. For both cases, used 4 concurrent tasks and 25 ms Run Duration.

When using Regex Replace, performance was slightly better, as expected.

When using Literal Replace, performance was more pronounced.

> Improve Performance of ReplaceText processor
> --------------------------------------------
>
>                 Key: NIFI-10887
>                 URL: https://issues.apache.org/jira/browse/NIFI-10887
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>              Labels: performance
>             Fix For: 1.20.0
>
>         Attachments: ReplaceText-LiteralReplace-AfterChanges.png, ReplaceText-LiteralReplace-BeforeChanges.png, ReplaceText-RegexReplace-AfterChanges.png, ReplaceText-RegexReplace-BeforeChanges.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When performing some tests with the ReplaceText processor, I found that it seemed to be quite a bit slower than I expected, especially when using a Replacement Strategy of "Literal Replace" and when using a lot of small FlowFiles.
> As a result, I performed some profiling and identified a few areas that could use some improvement:
>  * When using the Literal Replace strategy, we  find matches using {{Pattern.compile(Pattern.quote(...));}} and then using {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to just using {{String.indexOf(...)}} and accounted for approximately 30% of the time spent in the processor.
>  * A significant amount of time was spent flushing the write buffer, as it flushes to disk when finished writing to each individual FlowFile. Even when we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets delegated all the way down to the FileOutputStream. However, when using ProcessSession.append(), we intercept this with a NonFlushableOutputStream. We should do this when calling ProcessSession.write() as well. While it makes sense to flush data from the Processor layer's buffer, there's no need to flush past the session layer until the session is committed.
>  * A decent bit of time was spent in the session's get() method calling {{{}final Set<FlowFileRecord> set = unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's hashCode() method, which is the JVM default. We can easily implement hashCode() to just return the hashCode of the identifier, which is a String. This is a pre-computed hashcode so provides constant time of 0 ms (with the exception of the method call itself) so eliminates the expense here.
>  * When using a Run Duration > 0 ms, we can hold InputStreams open by processing multiple FlowFiles in a given Session. This can also significantly improve performance. As such, we should make the default run duration 25 ms instead of 0 ms.
>  * A common pattern with ReplaceText is to prepend text to the beginning of a FlowFile, or line. And then use another ReplaceText to append text to the end of a FlowFile, or line. We should have a strategy for "Surround" that allow us to both Prepend text and Append text. This will result in double the performance for this use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)