You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2021/01/06 19:45:00 UTC

[jira] [Assigned] (HUDI-1479) Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()

     [ https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar reassigned HUDI-1479:
------------------------------------

    Assignee: Udit Mehrotra  (was: Vinoth Chandar)

> Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-1479
>                 URL: https://issues.apache.org/jira/browse/HUDI-1479
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Code Cleanup
>            Reporter: Vinoth Chandar
>            Assignee: Udit Mehrotra
>            Priority: Blocker
>             Fix For: 0.7.0
>
>         Attachments: image-2021-01-05-10-00-35-187.png
>
>
> *Change #1*
> {code:java}
> public static List<String> getAllPartitionPaths(FileSystem fs, String basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
>                                                   boolean assumeDatePartitioning) throws IOException {
>     if (assumeDatePartitioning) {
>       return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
>     } else {
>       HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", useFileListingFromMetadata,
>           verifyListings, false, false);
>       return tableMetadata.getAllPartitionPaths();
>     }
>  }
> {code}
> is the current implementation, where `HoodieTableMetadata.create()` always creates `HoodieBackedTableMetadata`. Instead we should create `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. This helps address https://github.com/apache/hudi/pull/2398/files#r550709687
> *Change #2*
> On master, we have the `HoodieEngineContext` abstraction, which allows for parallel execution. We should consider moving it to `hudi-common` (its doable) and then have `FileSystemBackedTableMetadata` redone such that it can do parallelized listings using the passed in engine. either HoodieSparkEngineContext or HoodieJavaEngineContext. HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code. We should take one pass and see if that can be redone a bit as well.  Food for thought: https://github.com/apache/hudi/pull/2398#discussion_r550711216
>  
> *Change #3*
> There are places, where we call fs.listStatus() directly. We should make them go through the HoodieTable.getMetadata()... route as well. Essentially, all listing should be concentrated to `FileSystemBackedTableMetadata`
> !image-2021-01-05-10-00-35-187.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)