You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2023/05/18 15:52:00 UTC
[jira] [Commented] (NIFI-11557) Eliminate use of Files.walkFileTree for any performance-critical parts of application

    [ https://issues.apache.org/jira/browse/NIFI-11557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723950#comment-17723950 ] 

Mark Payne commented on NIFI-11557:
-----------------------------------

Looking further into this, I found that the logic that we have currently that scans through the content repo serves two purposes:
1. To count how many files are archived
2. To determine the timestamp of the oldest archived file.

The timestamp of the oldest archived file was to be used for performance gains, in order to determine that there are no files that need to be cleaned up due to time constraints and as a result don't bother scanning in the background.
Interestingly, this code was buggy - while it checked the last modified time of each file, it then compared it to the 'oldestTimestamp' but 'oldestTimestamp' was initialized to 0, which means that it would always remain 0. As a result, this code was very expensive and unneeded.

We only really need to count the number of files archived. This can be achieved MUCH more efficiently by simply performing a {{File.listFiles}} call on each archive directory. This will drastically improve startup performance in cases where there are millions of files archived.

> Eliminate use of Files.walkFileTree for any performance-critical parts of application
> -------------------------------------------------------------------------------------
>
>                 Key: NIFI-11557
>                 URL: https://issues.apache.org/jira/browse/NIFI-11557
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework, Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.latest, 2.latest
>
>
> The FileSystemRepository (content repo implementation) as well as ListFile both make use of the {{Files.walkFileTree}} method. Recently, I worked with a user who had horribly long startup times. Thread dumps show that the time was almost entirely in the FileSystemRepository's {{initializeRepository}} method as it is walking the file tree in order to determine which archive files can be cleaned up next. This is done during startup and again periodically in background threads.
> I made a small modification locally to instead use the standard synchronous IO methods ( {{File.listFiles}} method. I used GenerateFlowFile to generate 1-byte FlowFiles and set  {{nifi.content.claim.max.appendable.size=1 B}} in nifi.properties in order to generate a huge number of files - about 1.2 million files in the content repository and restarted a few times. Additionally, added some log lines to show how long this part of the startup process took.
> With the existing code, startup took 210 seconds (3.5 mins). With the new implementation, it took 6.7 seconds. The appears to be due to the fact that when using NIO.2 for every file, it does an individual disk access to obtain File attributes, while when using the {{File.listFiles}} method the File objects that are returned already have the necessary attributes. As a result, the NIO.2 approach makes millions of disk accesses that are unnecessary. As the number of files in the repository grows, the discrepancy also grows.
> We need to eliminate any use of {{File.walkFileTree}} for any performance-critical parts of the codebase.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)