You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2022/11/28 18:45:00 UTC
[jira] [Commented] (NIFI-10888) Improve performance of Record Readers when inferring schema of small FlowFiles

    [ https://issues.apache.org/jira/browse/NIFI-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640189#comment-17640189 ] 

Mark Payne commented on NIFI-10888:
-----------------------------------

Benchmarks show that we see significant improvements in performance. Screenshots of performance before and after for both Regex Replace and Literal Replace. For both cases, used 4 concurrent tasks and 25 ms Run Duration.

When using Regex Replace, performance was slightly better, as expected.

When using Literal Replace, performance was more pronounced.

> Improve performance of Record Readers when inferring schema of small FlowFiles
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-10888
>                 URL: https://issues.apache.org/jira/browse/NIFI-10888
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>              Labels: performance
>             Fix For: 1.20.0
>
>         Attachments: ReplaceText-LiteralReplace-AfterChanges.png, ReplaceText-LiteralReplace-BeforeChanges.png, ReplaceText-RegexReplace-AfterChanges.png, ReplaceText-RegexReplace-BeforeChanges.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we infer the schema of a FlowFile, the Record Reader has to read all of the data in the FlowFile in order to infer the schema accurately. As a result, when we use a Record Reader, by default, we must parse the entire FlowFile, then seek back to the beginning of it, and parse the entire FlowFile again in order to return the records.
> It turns out that for smaller FlowFiles, the most expensive part of this cycle is actually seeking back to the beginning of the FlowFile (via {{{}InputStream.reset(){}}}). When {{InputStream.reset()}} is called, it closes the current InputStream and opens a new one, reading from the Content Repository again, causing a disk seek.
> Instead, if {{InputStream.mark()}} is called, we should use a BufferedInputStream under the hood, and if {{reset()}} is then called, we should call {{BufferedInputStream.reset()}} if the number of bytes consumed since mark is less than or equal to the read limit. We should then use {{{}InputStream.mark(1024 * 1024){}}}.
> Effectively, we should buffer up to 1 MB worth of content when inferring a schema. As a result, we can avoid that extra disk seek. For FlowFiles larger than 1 MB, this will not make a difference in performance. However, for larger FlowFiles it is less of a concern, simply because we are performing the seek less frequently (i.e., if we have 10 FlowFiles, each 50 MB vs. 1000 FlowFiles each 5 KB, we end up seeking 100x less frequently in the case of larger FlowFiles).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)