You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/10 04:37:55 UTC

[GitHub] [hudi] nsivabalan opened a new pull request, #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/o metadata table

nsivabalan opened a new pull request, #5829:
URL: https://github.com/apache/hudi/pull/5829

   ## What is the purpose of the pull request
   
   - Looks like getAllPartitionPaths have regressed in perf from 0.9.0 to 0.11.0. I could able to triage it to newly written code where in "isExists()" is called to determine Hoodie partition metadata. So, reverting the change to how it was before. 
   
   I tried listing a table with 10k partitions. 0.9.0 is taking 85 to 90 secs, while 0.11.0 took ~800 secs. 
   HoodiePartitionMetadata.hasPartitionMetadata is taking roughly 50 to 60% when profiled. With this patch, listing took ~90 secs. 
   
   ![Screen Shot 2022-06-10 at 12 35 28 AM](https://user-images.githubusercontent.com/513218/172991214-5c70a81e-9099-4555-aabb-4cf16926f1ca.png)
   
   
   ## Brief change log
   
   - Fixing getAllPartitionPaths for FileSystemBackedMetadata.
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152675808

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152817448

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152821655

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1152817448",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1c8a4d9872985c18057bdad684f8a564ee5f8998",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9224",
       "triggerID" : "1c8a4d9872985c18057bdad684f8a564ee5f8998",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c797ff2c36086f2e5efebd03df3cecb324c555c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223) 
   * 1c8a4d9872985c18057bdad684f8a564ee5f8998 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9224) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1151997687

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152820770

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1152817448",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1c8a4d9872985c18057bdad684f8a564ee5f8998",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1c8a4d9872985c18057bdad684f8a564ee5f8998",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c797ff2c36086f2e5efebd03df3cecb324c555c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223) 
   * 1c8a4d9872985c18057bdad684f8a564ee5f8998 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5829:
URL: https://github.com/apache/hudi/pull/5829#discussion_r894833120


##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +83,35 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }
-          });
+      engineContext.foreach(dirToFileListing, new SerializableConsumer<Pair<Path, FileStatus[]>>() {
+        @Override
+        public void accept(Pair<Path, FileStatus[]> p) throws Exception {
+          Option<FileStatus> partitionMetaFile = Option.fromJavaOptional(Arrays.stream(p.getRight()).parallel()
+              .filter(fileStatus -> fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
+              .findFirst());
+
+          if (partitionMetaFile.isPresent()) {
+            // Is a partition.
+            String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), p.getLeft());

Review Comment:
   Please create a separate var for `new Path(datasetBasePath)`. `Path` ctor has non-trivial amount of overhead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua merged pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
yihua merged PR #5829:
URL: https://github.com/apache/hudi/pull/5829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5829:
URL: https://github.com/apache/hudi/pull/5829#discussion_r932112592


##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +82,31 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }
-          });
+      dirToFileListing.forEach(p -> {
+        Option<FileStatus> partitionMetaFile = Option.fromJavaOptional(Arrays.stream(p.getRight()).parallel()
+            .filter(fileStatus -> fileStatus.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
+            .findFirst());
+
+        if (partitionMetaFile.isPresent()) {
+          // Is a partition.
+          String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), p.getLeft());
+          partitionPaths.add(partitionName);
+        } else {
+          // Add sub-dirs to the queue
+          pathsToList.addAll(Arrays.stream(p.getRight())
+              .filter(fileStatus -> fileStatus.isDirectory() && !fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))

Review Comment:
   Reverted this in the latest patch put up https://github.com/apache/hudi/pull/6234 
   



##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +82,31 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }
-          });
+      dirToFileListing.forEach(p -> {
+        Option<FileStatus> partitionMetaFile = Option.fromJavaOptional(Arrays.stream(p.getRight()).parallel()
+            .filter(fileStatus -> fileStatus.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
+            .findFirst());
+
+        if (partitionMetaFile.isPresent()) {
+          // Is a partition.
+          String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), p.getLeft());
+          partitionPaths.add(partitionName);
+        } else {
+          // Add sub-dirs to the queue
+          pathsToList.addAll(Arrays.stream(p.getRight())
+              .filter(fileStatus -> fileStatus.isDirectory() && !fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))

Review Comment:
   Reverted this in the latest patch put up https://github.com/apache/hudi/pull/6234 . no follow ups required for now.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5829:
URL: https://github.com/apache/hudi/pull/5829#discussion_r894916125


##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +82,31 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }
-          });
+      dirToFileListing.forEach(p -> {
+        Option<FileStatus> partitionMetaFile = Option.fromJavaOptional(Arrays.stream(p.getRight()).parallel()
+            .filter(fileStatus -> fileStatus.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
+            .findFirst());
+
+        if (partitionMetaFile.isPresent()) {
+          // Is a partition.
+          String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), p.getLeft());
+          partitionPaths.add(partitionName);
+        } else {
+          // Add sub-dirs to the queue
+          pathsToList.addAll(Arrays.stream(p.getRight())
+              .filter(fileStatus -> fileStatus.isDirectory() && !fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))

Review Comment:
   I tried to incorporate this. I wanted to see if we can do this only for first time where we list the base path. but we have to do it within the while loop and so we might do it unnecessarily for every loop anyways. will go ahead w/ the patch for now for 0.11.1. but lets discuss and I can put up a follow up PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5829:
URL: https://github.com/apache/hudi/pull/5829#discussion_r895063517


##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +83,35 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }

Review Comment:
   Synced offline.  This is still needed for non-partitioned table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152312290

   @zhangyue19921010 : I would like to discuss this patch w/ you. Looks like for a table w/ very large no of partitions, we are regressing after https://github.com/apache/hudi/pull/4643. Can you take a look to see if we can come up w/ some middle ground may be. 10x is definitely bad. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152313847

   From what I can infer, guess it might be hard to stick to one solution. 
   probably it may vary depending on whether we have very high no of partitions with less no of files per partition or whether we have too many files within each partition with less no of partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1153996052

   https://issues.apache.org/jira/browse/HUDI-4242


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1151956525

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152789694

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c797ff2c36086f2e5efebd03df3cecb324c555c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152840786

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1152817448",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "1c8a4d9872985c18057bdad684f8a564ee5f8998",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9224",
       "triggerID" : "1c8a4d9872985c18057bdad684f8a564ee5f8998",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c8a4d9872985c18057bdad684f8a564ee5f8998 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9224) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5829:
URL: https://github.com/apache/hudi/pull/5829#discussion_r894799894


##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +83,35 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()

Review Comment:
   @nsivabalan making this parallelized with `engineContext` should solve the problem and avoid listing each partition, right, instead of reverting the changes?



##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +83,35 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }

Review Comment:
   This is redundant as `HoodiePartitionMetadata::hasPartitionMetadata` should already cover that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1151954252

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152819982

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1152817448",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 1c797ff2c36086f2e5efebd03df3cecb324c555c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on a diff in pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on code in PR #5829:
URL: https://github.com/apache/hudi/pull/5829#discussion_r929286057


##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -68,41 +68,46 @@ public FileStatus[] getAllFilesInPartition(Path partitionPath) throws IOExceptio
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
+    Path basePath = new Path(datasetBasePath);
+    FileSystem fs = basePath.getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
     }
 
     List<Path> pathsToList = new CopyOnWriteArrayList<>();
-    pathsToList.add(new Path(datasetBasePath));
+    pathsToList.add(basePath);
     List<String> partitionPaths = new CopyOnWriteArrayList<>();
 
     while (!pathsToList.isEmpty()) {
       // TODO: Get the parallelism from HoodieWriteConfig
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }
-          });
+      dirToFileListing.forEach(p -> {

Review Comment:
   don't we have an utilty for this in FSUtils.



##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +82,31 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }
-          });
+      dirToFileListing.forEach(p -> {
+        Option<FileStatus> partitionMetaFile = Option.fromJavaOptional(Arrays.stream(p.getRight()).parallel()
+            .filter(fileStatus -> fileStatus.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
+            .findFirst());
+
+        if (partitionMetaFile.isPresent()) {
+          // Is a partition.
+          String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), p.getLeft());
+          partitionPaths.add(partitionName);
+        } else {
+          // Add sub-dirs to the queue
+          pathsToList.addAll(Arrays.stream(p.getRight())
+              .filter(fileStatus -> fileStatus.isDirectory() && !fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))

Review Comment:
   is there a follow up JIRA?



##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +83,35 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()

Review Comment:
   +1 . I think if we simply do the processing we need to do within the single engineContext.map(..) call, we should be able to solve the original problem in #4643. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152765348

   I went back and forth based on feedback given. 
   
   Here are my findings. 
   For one of my test tables in S3, with EMR cluster: 
   1. With 0.11.0:
   147 secs. 
   2. With this patch as is (where engine context is not used for 2nd phase)
   5.7 secs. 
   3. Latest master + adding engineContext for 2nd phase:
   16 secs. 
   4. I also tried completely rewriting the dag. 
   5. 12 secs. 
   ```
   
         while (!pathsToList.isEmpty()) {
           // TODO: Get the parallelism from HoodieWriteConfig
           int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
   
           // List all directories in parallel
           List<FileStatus> dirToFileListing = engineContext.flatMap(pathsToList, path -> {
             FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
             return Arrays.stream(fileSystem.listStatus(path));
           }, listingParallelism);
           pathsToList.clear();
   
           // if current dictionary contains PartitionMetadata, add it to result
           // if current dictionary does not contain PartitionMetadata, add it to queue
           int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size());
           List<Pair<Option<String>, Option<Path>>> result = engineContext.map(dirToFileListing, fileStatus -> {
             FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get());
             if (fileStatus.isDirectory()) {
               if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) {
                 return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath())), Option.empty());
               } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
                 return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
               }
             } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
               String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
               return Pair.of(Option.of(partitionName), Option.empty());
             }
             return Pair.of(Option.empty(), Option.empty());
           }, fileListingParallelism);
   
           partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()).map(entry -> entry.getKey().get())
               .collect(Collectors.toList()));
   
           pathsToList.addAll(result.stream().filter(entry -> entry.getValue().isPresent()).map(entry -> entry.getValue().get())
               .collect(Collectors.toList()));
   }
   ```
   
   So, based on above findings, I will go w/ what we have in this patch in its current state. Will address Raymond's and Alexey's feedback alone and unblock 0.11.1.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5829:
URL: https://github.com/apache/hudi/pull/5829#discussion_r894481081


##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##########
@@ -82,27 +82,31 @@ public List<String> getAllPartitionPaths() throws IOException {
       int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size());
 
       // List all directories in parallel
-      List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
         FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-        return fileSystem.listStatus(path);
+        return Pair.of(path, fileSystem.listStatus(path));
       }, listingParallelism);
       pathsToList.clear();
 
       // if current dictionary contains PartitionMetadata, add it to result
       // if current dictionary does not contain PartitionMetadata, add it to queue
-      dirToFileListing.stream().flatMap(Arrays::stream).parallel()
-          .forEach(fileStatus -> {
-            if (fileStatus.isDirectory()) {
-              if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) {
-                partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath()));
-              } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) {
-                pathsToList.add(fileStatus.getPath());
-              }
-            } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) {
-              String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent());
-              partitionPaths.add(partitionName);
-            }
-          });
+      dirToFileListing.forEach(p -> {
+        Option<FileStatus> partitionMetaFile = Option.fromJavaOptional(Arrays.stream(p.getRight()).parallel()
+            .filter(fileStatus -> fileStatus.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
+            .findFirst());
+
+        if (partitionMetaFile.isPresent()) {
+          // Is a partition.
+          String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), p.getLeft());
+          partitionPaths.add(partitionName);
+        } else {
+          // Add sub-dirs to the queue
+          pathsToList.addAll(Arrays.stream(p.getRight())
+              .filter(fileStatus -> fileStatus.isDirectory() && !fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))

Review Comment:
   looks like you don't need to check `METAFOLDER_NAME` at all. can we filter it out at the beginning for `pathsToList`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152637408

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209) 
   * fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] zhangyue19921010 commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152332531

   > @zhangyue19921010 : I would like to discuss this patch w/ you. Looks like for a table w/ very large no of partitions, we are regressing after #4643. Can you take a look to see if we can come up w/ some middle ground may be. 10x is definitely bad.
   
   Hi @nsivabalan Sorry for this blocking :< The performance depends on the bottleneck which is the number of partitions or the number of data files in each partitions. It's hard to balance. But as you said 10x is too bad. We definitely need to  revert this change before 0.11.1.
   
   After this maybe we can offer a config to let advanced users decide what kind of method he need to use based on their use case? or some way to detect  automatically
   
   Sorry for this blocking again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152573401

   @vingov : Can you review this to see we are good wrt BigQuery flow (partition meta file )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152640103

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209) 
   * fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152420997

   Here is what I am thinking. As we deduced, either solution is not going to work for both cases. one of them gives 2 to 3x better perf while degrades to 10x for other use-case. While the other has 2 to 3x degraded perf but does not degrade by 10x for other extreme. So, lets go ahead w/ this patch for now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152787293

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223",
       "triggerID" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222) 
   * 1c797ff2c36086f2e5efebd03df3cecb324c555c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9223) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5829: [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5829:
URL: https://github.com/apache/hudi/pull/5829#issuecomment-1152786004

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9209",
       "triggerID" : "b7558b9ff9e4dd756ebd0adf8a0f27e3d4e3dd40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222",
       "triggerID" : "fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1c797ff2c36086f2e5efebd03df3cecb324c555c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fa2ec47dbb84952d8086db54a42cebc6a4a1a7b3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9222) 
   * 1c797ff2c36086f2e5efebd03df3cecb324c555c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org