You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/12/14 17:11:00 UTC

[jira] [Commented] (NIFI-10888) Improve performance of Record Readers when inferring schema of small FlowFiles

    [ https://issues.apache.org/jira/browse/NIFI-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647619#comment-17647619 ] 

ASF subversion and git services commented on NIFI-10888:
--------------------------------------------------------

Commit 78be613a0f85b664695ea2cbfaf26163f9b8e454 in nifi's branch refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=78be613a0f ]

NIFI-10888: When inferring a schema using a Record Reader, buffer up to 1 MB of FlowFile content for the schema inference so that when we read the contents to obtain records we can use the buffered data. This helps in cases of small FlowFiles by not having to seek back to the beginning of the FlowFile every time.

Signed-off-by: Matthew Burgess <ma...@apache.org>

This closes #6725


> Improve performance of Record Readers when inferring schema of small FlowFiles
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-10888
>                 URL: https://issues.apache.org/jira/browse/NIFI-10888
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>              Labels: performance
>         Attachments: InferSchema-AfterChanges.png, InferSchema-BeforeChanges.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> When we infer the schema of a FlowFile, the Record Reader has to read all of the data in the FlowFile in order to infer the schema accurately. As a result, when we use a Record Reader, by default, we must parse the entire FlowFile, then seek back to the beginning of it, and parse the entire FlowFile again in order to return the records.
> It turns out that for smaller FlowFiles, the most expensive part of this cycle is actually seeking back to the beginning of the FlowFile (via {{{}InputStream.reset(){}}}). When {{InputStream.reset()}} is called, it closes the current InputStream and opens a new one, reading from the Content Repository again, causing a disk seek.
> Instead, if {{InputStream.mark()}} is called, we should use a BufferedInputStream under the hood, and if {{reset()}} is then called, we should call {{BufferedInputStream.reset()}} if the number of bytes consumed since mark is less than or equal to the read limit. We should then use {{{}InputStream.mark(1024 * 1024){}}}.
> Effectively, we should buffer up to 1 MB worth of content when inferring a schema. As a result, we can avoid that extra disk seek. For FlowFiles larger than 1 MB, this will not make a difference in performance. However, for larger FlowFiles it is less of a concern, simply because we are performing the seek less frequently (i.e., if we have 10 FlowFiles, each 50 MB vs. 1000 FlowFiles each 5 KB, we end up seeking 100x less frequently in the case of larger FlowFiles).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)