You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Lehel Boér (Jira)" <ji...@apache.org> on 2023/06/28 18:55:00 UTC
[jira] [Commented] (NIFI-11178) Improve memory efficiency of ListHDFS

    [ https://issues.apache.org/jira/browse/NIFI-11178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738247#comment-17738247 ] 

Lehel Boér commented on NIFI-11178:
-----------------------------------

Heap dump at the end of a run with the old HDFS while processing a 4-ary tree with 1000 0KB sized files/nodes (~350 000 files).

The HdfsNamesFileStatus objects retain ~241 MBs.

 

!image-2023-06-28-20-48-57-012.png!

*Heap dump comparison between old and new version*

 

!image-2023-06-28-20-53-21-210.png!

 

*Old ListHDFS memory usage:*

!image-2023-06-28-20-54-00-887.png!

*New ListHDFS memory usage*

!image-2023-06-28-20-54-32-188.png!

 

 

> Improve memory efficiency of ListHDFS
> -------------------------------------
>
>                 Key: NIFI-11178
>                 URL: https://issues.apache.org/jira/browse/NIFI-11178
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Lehel Boér
>            Priority: Major
>         Attachments: image-2023-06-28-20-48-40-182.png, image-2023-06-28-20-48-57-012.png, image-2023-06-28-20-53-21-210.png, image-2023-06-28-20-54-00-887.png, image-2023-06-28-20-54-32-188.png
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> ListHDFS is used extremely commonly. Typically, a listing consists of several hundred files or less. However, there are times (especially when performing the first listing) when the Processor is configured to recurse into subdirectories and creates a listing containing millions of files.
> Currently, performing a listing containing millions of files can occupy several GB of heap space. Analyzing a recent heap dump, it was found that a listing of 6.7 million files in HDFS occupied approximately 12 GB of heap space in NiFi. The heap usage can be tracked back to the fact that we hold in a local variable a HashSet<FileStatus> and these FileStatus objects occupy approximately 1-2 KB of heap each.
> There are several improvements that can be made here, some small changes will have significant memory improvements. There are also large changes that could dramatically reduce heap utilization, to nearly nothing. But these would require complex rewrites of the Processor that would be much more difficult to maintain.
> As a simple analysis, I built a small test to see how many FileStatus objects could be kept in memory:
> {code:java}
> final Set<FileStatus> statuses = new HashSet<>();
> for (int i=0; i < 10_000_000; i++) {
>     if (i % 10_000 == 0) {
>         System.out.println(i);
>     }
>     final FsPermission fsPermission = new FsPermission(777);
>     final FileStatus status = new FileStatus(2, true, 1, 4_000_000, System.currentTimeMillis(), 0L, fsPermission,
>         "owner-" + i, "group-" + i, null, new Path("/path/to/my/file-" + i + ".txt"), true, false, false);
>     statuses.add(status);
> }
> {code}
> This gives us a way to see how many FileStatus objects can be added to our set before we encounter OOME. With a 512 MB heap I reached approximately 1.13 million FileStatus objects. Note that the Paths here are very small, which occupies less memory than would normally be the case.
> Making one small change, to change the {{Set<FileStatus>}} to an {{ArrayList<FileStatus>}} yielded 1.21 million instead of 1.13 million objects. This is reasonable, since we don't expect any duplicates anyway.
> Another small change that was made was to introduce a new class that keeps only the fields we care about from the FileStatus:
> {code:java}
> private static class HdfsEntity {
>     private final String path;
>     private final long length;
>     private final boolean isdir;
>     private final long timestamp;
>     private final boolean canRead;
>     private final boolean canWrite;
>     private final boolean canExecute;
>     private final boolean encrypted;
>     public HdfsEntity(final FileStatus status) {
>         this.path = status.getPath().getName();
>         this.length = status.getLen();
>         this.isdir = status.isDirectory();
>         this.timestamp = status.getModificationTime();
>         this.canRead = status.getPermission().getGroupAction().implies(FsAction.READ);
>         this.canWrite = status.getPermission().getGroupAction().implies(FsAction.WRITE);
>         this.canExecute = status.getPermission().getGroupAction().implies(FsAction.EXECUTE);
>         this.encrypted = status.isEncrypted();
>     }
> }
>  {code}
> This introduced significant savings, allowing a {{List<HdfsEntity>}} to store 5.2 million objects instead of 1.13 million.
> It is worth noting here that the HdfsEntity doesn't store the 'group' and 'owner' that are part of the FileStatus. Interestingly, though, these values are strings that are extremely repetitive, but are unique strings on the heap. So a full solution here would mean tracking the owner and group, but doing so in a way that we use a Map<String, String> or something similar in order to reuse identical Strings on the heap (in much the same way that String.intern() does, but without interning the String as it's only reusable within a small context).
> These changes will yield significant memory improvements.
> However, more penetrating changes can make even large improvements:
>  * Rather than keeping a Set or a List of entities at all, we could instead just iterate over each of the FileStatus objects returned from HDFS and determine whether or not the file should be included in the listing. If so, create the FlowFile or write the Record to the RecordWriter. Eliminate the collection all together. This would likely result in code that is similar to ListS3, which has an internal {{S3ObjectWriter}} class. This is used to create an interface that can be used regardless of whether a RecordWriter is being used or not. This would provide very significant memory improvements. In the case of using a Record Writer, it may even reduce the size to nearly 0.
>  * However, when not using a Record Writer, we still will have the FlowFiles kept in memory until the session is committed. One method for dealing with this would be to change our algorithm so that instead of performing a single listing of the directory and all sub-directories, we instead sort sub-directories by name and then commit the session and update state for individual sub-directories. This would introduce significant risk, though, in ensuring that we don't create message duplication upon restart and that we don't lose data. So it's perhaps not the best option.
>  * Alternatively, we could document the concern when not using a Record Writer and provide info in an additionalDetails.html that shows the preferred method for using the processor when expecting many files. ListHDFS would be connected to SplitRecord that would split into chunks of say 10,000 Records. This would then go to a PartitionRecord that would create attributes for the fields. This would give us the same result as outputting without the Record Writer but in such a way that is much more memory efficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)