You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yun Gao (Jira)" <ji...@apache.org> on 2022/04/13 06:28:07 UTC

[jira] [Updated] (FLINK-11868) [filesystems] Introduce listStatusIterator API to file system

     [ https://issues.apache.org/jira/browse/FLINK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yun Gao updated FLINK-11868:
----------------------------
    Fix Version/s: 1.16.0

> [filesystems] Introduce listStatusIterator API to file system
> -------------------------------------------------------------
>
>                 Key: FLINK-11868
>                 URL: https://issues.apache.org/jira/browse/FLINK-11868
>             Project: Flink
>          Issue Type: New Feature
>          Components: FileSystems
>            Reporter: Yun Tang
>            Assignee: Yun Tang
>            Priority: Minor
>              Labels: auto-deprioritized-major, auto-unassigned, stale-assigned
>             Fix For: 1.15.0, 1.16.0
>
>
> From existed experience, we know {{listStatus}} is expensive for many distributed file systems especially when the folder contains too many files. This method would not only block the thread until result is return but also could cause OOM due to the returned array of {{FileStatus}} is really large. I think we should already learn it from FLINK-7266 and FLINK-8540.
> However, list file status under a path is really helpful in many situations. Thankfully, many distributed file system noticed that and provide API such as {{[listStatusIterator|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatusIterator(org.apache.hadoop.fs.Path)]}} to call the file system on demand.
>  
> We should also introduce this API and replace current implementation which used previous {{listStatus}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)