You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2023/05/23 20:00:00 UTC

[jira] [Updated] (NIFI-11584) MergeContent can be more efficient in terms of disk access

     [ https://issues.apache.org/jira/browse/NIFI-11584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Payne updated NIFI-11584:
------------------------------
    Status: Patch Available  (was: Open)

> MergeContent can be more efficient in terms of disk access
> ----------------------------------------------------------
>
>                 Key: NIFI-11584
>                 URL: https://issues.apache.org/jira/browse/NIFI-11584
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework, Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.latest, 2.latest
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Long ago (NIFI-516), we updated MergeContent so that when it read from a FlowFile, it asked the ProcessSession to not manage the Input Stream and instead close the InputStream when finished reading. This was done because if we had say 50,000 FlowFiles to merge together, we'd have 50,000 ProcessSessions. Since the session by default holds open the InputStream until the session is committed/rolled back, we would hold open 50,000 FileInputStreams. This would quickly lead to IOExceptions due to "too many open files". So in NIFI-516, we addressed the issue by not holding the stream open.
> Then, in NIFI-2850 we made things much more efficient by allowing FlowFiles to be moved from 1 ProcessSession to another. So now instead of using 50,000 Process Sessions, we have a single ProcessSession for the whole bin.
> However, we did not change the behavior of asking ProcessSession not to hold open the stream. We can now allow the ProcessSession to manage the InputStream as it does elsewhere.
> Additionally, looking at the codebase, MergeContent is the only component that uses this feature of the Process Session - and this is a bad practice as the ProcessSession.migrate capability makes it unnecessary to ever do this. As a result, we should deprecate the {{void read(FlowFile source, boolean allowSessionStreamManagement, InputStreamCallback reader) throws FlowFileAccessException}} method in 1.x and remove it in 2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)