You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2022/11/28 18:32:00 UTC

[jira] [Created] (NIFI-10888) Improve performance of Record Readers when inferring schema of small FlowFiles

Mark Payne created NIFI-10888:
---------------------------------

             Summary: Improve performance of Record Readers when inferring schema of small FlowFiles
                 Key: NIFI-10888
                 URL: https://issues.apache.org/jira/browse/NIFI-10888
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Extensions
            Reporter: Mark Payne
            Assignee: Mark Payne


When we infer the schema of a FlowFile, the Record Reader has to read all of the data in the FlowFile in order to infer the schema accurately. As a result, when we use a Record Reader, by default, we must parse the entire FlowFile, then seek back to the beginning of it, and parse the entire FlowFile again in order to return the records.

It turns out that for smaller FlowFiles, the most expensive part of this cycle is actually seeking back to the beginning of the FlowFile (via {{{}InputStream.reset(){}}}). When {{InputStream.reset()}} is called, it closes the current InputStream and opens a new one, reading from the Content Repository again, causing a disk seek.

Instead, if {{InputStream.mark()}} is called, we should use a BufferedInputStream under the hood, and if {{reset()}} is then called, we should call {{BufferedInputStream.reset()}} if the number of bytes consumed since mark is less than or equal to the read limit. We should then use {{{}InputStream.mark(1024 * 1024){}}}.

Effectively, we should buffer up to 1 MB worth of content when inferring a schema. As a result, we can avoid that extra disk seek. For FlowFiles larger than 1 MB, this will not make a difference in performance. However, for larger FlowFiles it is less of a concern, simply because we are performing the seek less frequently (i.e., if we have 10 FlowFiles, each 50 MB vs. 1000 FlowFiles each 5 KB, we end up seeking 100x less frequently in the case of larger FlowFiles).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)