You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/18 01:31:20 UTC

[GitHub] [hudi] yihua opened a new pull request, #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

yihua opened a new pull request, #7690:
URL: https://github.com/apache/hudi/pull/7690

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1404365783

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619",
       "triggerID" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14637",
       "triggerID" : "1404349041",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "cb54aec5453d7f5a448674ffdcfac2351ccc89f8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "cb54aec5453d7f5a448674ffdcfac2351ccc89f8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 98ff092b7f0142a017132604fd8b3c81fb0fb4d5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14637) 
   * cb54aec5453d7f5a448674ffdcfac2351ccc89f8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on code in PR #7690:
URL: https://github.com/apache/hudi/pull/7690#discussion_r1086114031


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -81,6 +86,8 @@ public abstract class AbstractTableFileSystemView implements SyncableFileSystemV
   private static final Logger LOG = LogManager.getLogger(AbstractTableFileSystemView.class);
 
   protected HoodieTableMetaClient metaClient;
+  // TODO: to pass in the actual config
+  protected boolean assumeDatePartitioning = false;

Review Comment:
   no longer needed.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {

Review Comment:
   After thinking about this again, I think it makes sense to disallow FS-based listing call for all partitions.  Now `getAllPartitionPaths()` throws `HoodieException` if invoked for FS-based listing.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    partitionList.forEach(partition ->
+        addedPartitions.computeIfAbsent(partition, partitionPathStr -> {

Review Comment:
   Makes sense.  I made a change to synchronize the partition loading with `addedPartitions`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #7690:
URL: https://github.com/apache/hudi/pull/7690#discussion_r1086169007


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +292,113 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    synchronized (addedPartitions) {
+      partitionList.forEach(partition ->

Review Comment:
   we can move this to the end and we can remove synchronized



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1403186587

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619",
       "triggerID" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 98ff092b7f0142a017132604fd8b3c81fb0fb4d5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1396342914

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c3cc74c973ae760ebd1803dfdc638e9821f181a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1403021176

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c3cc74c973ae760ebd1803dfdc638e9821f181a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411) 
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1404360677

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619",
       "triggerID" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14637",
       "triggerID" : "1404349041",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 98ff092b7f0142a017132604fd8b3c81fb0fb4d5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14637) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on code in PR #7690:
URL: https://github.com/apache/hudi/pull/7690#discussion_r1086116287


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    partitionList.forEach(partition ->
+        addedPartitions.computeIfAbsent(partition, partitionPathStr -> {
+          if (!isPartitionAvailableInStore(partitionPathStr)) {
+            partitionSet.add(partitionPathStr);
+          }
+          return true;
+        })
+    );
+
+    if (!partitionSet.isEmpty()) {
+      long beginTs = System.currentTimeMillis();
+      // Not loaded yet
+      try {
+        LOG.info("Building file system view for partitions " + partitionSet);
+
+        // Pairs of relative partition path and absolute partition path
+        List<Pair<String, Path>> absolutePartitionPathList = partitionSet.stream()
+            .map(partition -> Pair.of(
+                partition, FSUtils.getPartitionPath(metaClient.getBasePathV2(), partition)))
+            .collect(Collectors.toList());
+        long beginLsTs = System.currentTimeMillis();
+        Map<Pair<String, Path>, FileStatus[]> statusesMap =
+            listPartitions(absolutePartitionPathList);
+        long endLsTs = System.currentTimeMillis();
+        LOG.debug("Time taken to list partitions " + partitionSet + " =" + (endLsTs - beginLsTs));
+        statusesMap.forEach((partitionPair, statuses) -> {
+          String relativePartitionStr = partitionPair.getLeft();
+          List<HoodieFileGroup> groups = addFilesToView(statuses);
+          if (groups.isEmpty()) {
+            storePartitionView(relativePartitionStr, new ArrayList<>());
+          }
+          LOG.debug("#files found in partition (" + relativePartitionStr + ") =" + statuses.length);
+        });
+      } catch (IOException e) {
+        throw new HoodieIOException("Failed to list base files in partitions " + partitionSet, e);
+      }
+      long endTs = System.currentTimeMillis();
+      LOG.debug("Time to load partition " + partitionSet + " =" + (endTs - beginTs));
+    }
+  }
+
+  /***
+   * @return A list of relative partition paths of all partitions.
+   * @throws IOException upon error.
+   */
+  protected List<String> getAllPartitionPaths() throws IOException {

Review Comment:
   Fixed.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -550,6 +674,28 @@ public final Stream<HoodieBaseFile> getLatestBaseFilesBeforeOrOn(String partitio
     }
   }
 
+  @Override

Review Comment:
   I think eventually we should unify the code path between metadata-table-based and FS-based code path, so I put it here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on code in PR #7690:
URL: https://github.com/apache/hudi/pull/7690#discussion_r1086092057


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/savepoint/SavepointActionExecutor.java:
##########
@@ -84,15 +84,32 @@ public HoodieSavepointMetadata execute() {
           "Could not savepoint commit " + instantTime + " as this is beyond the lookup window " + lastCommitRetained);
 
       context.setJobStatus(this.getClass().getSimpleName(), "Collecting latest files for savepoint " + instantTime + " " + table.getConfig().getTableName());
-      List<String> partitions = FSUtils.getAllPartitionPaths(context, config.getMetadataConfig(), table.getMetaClient().getBasePath());
-      Map<String, List<String>> latestFilesMap = context.mapToPair(partitions, partitionPath -> {
-        // Scan all partitions files with this commit time
-        LOG.info("Collecting latest files in partition path " + partitionPath);
-        TableFileSystemView.BaseFileOnlyView view = table.getBaseFileOnlyView();
-        List<String> latestFiles = view.getLatestBaseFilesBeforeOrOn(partitionPath, instantTime)
-            .map(HoodieBaseFile::getFileName).collect(Collectors.toList());
-        return new ImmutablePair<>(partitionPath, latestFiles);
-      }, null);
+      TableFileSystemView.BaseFileOnlyView view = table.getBaseFileOnlyView();
+
+      Map<String, List<String>> latestFilesMap;
+      // NOTE: for performance, we have to use different logic here for listing the latest files
+      // before or on the given instant:
+      // (1) using metadata-table-based file listing: instead of parallelizing the partition
+      // listing which incurs unnecessary metadata table reads, we directly read the metadata
+      // table once in a batch manner through the timeline server;
+      // (2) using direct file system listing:  we parallelize the partition listing so that
+      // each partition can be listed on the file system concurrently through Spark.
+      if (config.getMetadataConfig().enabled()) {

Review Comment:
   This ticket tracks the follow-up: [HUDI-5611](https://issues.apache.org/jira/browse/HUDI-5611).



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java:
##########
@@ -157,8 +156,8 @@ public HoodieWriteConfig.Builder getConfigBuilder(String schemaStr, IndexType in
         .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(indexType).build())
         .withEmbeddedTimelineServerEnabled(true).withFileSystemViewConfig(FileSystemViewStorageConfig.newBuilder()
             .withEnableBackupForRemoteFileSystemView(false) // Fail test if problem connecting to timeline-server
-            .withRemoteServerPort(timelineServicePort)
-            .withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE).build());
+            .withRemoteServerPort(timelineServicePort).build());
+    //.withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE)

Review Comment:
   This is for testing the default storage type temporarily only.  Will revert this.



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java:
##########
@@ -157,8 +156,8 @@ public HoodieWriteConfig.Builder getConfigBuilder(String schemaStr, IndexType in
         .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(indexType).build())
         .withEmbeddedTimelineServerEnabled(true).withFileSystemViewConfig(FileSystemViewStorageConfig.newBuilder()
             .withEnableBackupForRemoteFileSystemView(false) // Fail test if problem connecting to timeline-server
-            .withRemoteServerPort(timelineServicePort)
-            .withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE).build());
+            .withRemoteServerPort(timelineServicePort).build());
+    //.withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE)

Review Comment:
   Through this testing, I discovered that `SpillableMapBasedFileSystemView` and `RocksDbBasedFileSystemView` are not integrated with metadata-table-based file listing. Filed a ticket: HUDI-5612.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -596,8 +742,8 @@ public final Stream<HoodieBaseFile> getLatestBaseFilesInRange(List<String> commi
       return fetchAllStoredFileGroups()
           .filter(fileGroup -> !isFileGroupReplacedBeforeAny(fileGroup.getFileGroupId(), commitsToReturn))
           .map(fileGroup -> Pair.of(fileGroup.getFileGroupId(), Option.fromJavaOptional(
-          fileGroup.getAllBaseFiles().filter(baseFile -> commitsToReturn.contains(baseFile.getCommitTime())
-              && !isBaseFileDueToPendingCompaction(baseFile) && !isBaseFileDueToPendingClustering(baseFile)).findFirst()))).filter(p -> p.getValue().isPresent())
+              fileGroup.getAllBaseFiles().filter(baseFile -> commitsToReturn.contains(baseFile.getCommitTime())

Review Comment:
   This is automatically changed due to the IDE formatter.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -550,6 +674,28 @@ public final Stream<HoodieBaseFile> getLatestBaseFilesBeforeOrOn(String partitio
     }
   }
 
+  @Override
+  public final Map<String, Stream<HoodieBaseFile>> getAllLatestBaseFilesBeforeOrOn(String maxCommitTime) {
+    try {
+      readLock.lock();
+
+      List<String> formattedPartitionList = ensureAllPartitionsLoadedCorrectly();
+      return formattedPartitionList.stream().collect(Collectors.toMap(
+          Function.identity(),
+          partitionPath -> fetchAllStoredFileGroups(partitionPath)
+              .filter(fileGroup -> !isFileGroupReplacedBeforeOrOn(fileGroup.getFileGroupId(), maxCommitTime))

Review Comment:
   We can extract the core logic of filtering file groups.  We cannot directly reuse `getLatestBaseFilesBeforeOrOn()` because it triggers metadata table lookup per partition, which this PR is trying to fix.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    partitionList.forEach(partition ->
+        addedPartitions.computeIfAbsent(partition, partitionPathStr -> {
+          if (!isPartitionAvailableInStore(partitionPathStr)) {
+            partitionSet.add(partitionPathStr);
+          }
+          return true;
+        })
+    );
+
+    if (!partitionSet.isEmpty()) {
+      long beginTs = System.currentTimeMillis();
+      // Not loaded yet
+      try {
+        LOG.info("Building file system view for partitions " + partitionSet);
+
+        // Pairs of relative partition path and absolute partition path
+        List<Pair<String, Path>> absolutePartitionPathList = partitionSet.stream()
+            .map(partition -> Pair.of(
+                partition, FSUtils.getPartitionPath(metaClient.getBasePathV2(), partition)))
+            .collect(Collectors.toList());
+        long beginLsTs = System.currentTimeMillis();
+        Map<Pair<String, Path>, FileStatus[]> statusesMap =
+            listPartitions(absolutePartitionPathList);
+        long endLsTs = System.currentTimeMillis();
+        LOG.debug("Time taken to list partitions " + partitionSet + " =" + (endLsTs - beginLsTs));
+        statusesMap.forEach((partitionPair, statuses) -> {
+          String relativePartitionStr = partitionPair.getLeft();
+          List<HoodieFileGroup> groups = addFilesToView(statuses);
+          if (groups.isEmpty()) {
+            storePartitionView(relativePartitionStr, new ArrayList<>());
+          }
+          LOG.debug("#files found in partition (" + relativePartitionStr + ") =" + statuses.length);
+        });
+      } catch (IOException e) {
+        throw new HoodieIOException("Failed to list base files in partitions " + partitionSet, e);
+      }
+      long endTs = System.currentTimeMillis();
+      LOG.debug("Time to load partition " + partitionSet + " =" + (endTs - beginTs));
+    }
+  }
+
+  /***
+   * @return A list of relative partition paths of all partitions.
+   * @throws IOException upon error.
+   */
+  protected List<String> getAllPartitionPaths() throws IOException {
+    // TODO: integrate the direct FS listing with the actual engine context
+    LOG.warn("Getting all partition paths with file system listing sequentially can be very slow. "
+        + "This should not be invoked.");
+    String bathPath = metaClient.getBasePathV2().toString();
+    FileSystem fs = new Path(bathPath).getFileSystem(metaClient.getHadoopConf());
+    if (assumeDatePartitioning) {
+      return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, bathPath);
+    }
+    return FSUtils.getPartitionPathsWithPrefix(

Review Comment:
   This is the same logic used by `FileSystemBackedTableMetadata`.



##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -319,6 +320,68 @@ public static List<String> getAllPartitionPaths(HoodieEngineContext engineContex
     }
   }
 
+  /**
+   * Gets all partition paths in a table, based on a relative partition path prefix.
+   *
+   * @param engineContext {@link HoodieEngineContext} instance for file listing.
+   * @param hadoopConf    Hadoop configuration.
+   * @param basePath      Base path of the table.
+   * @param prefix        Relative partition path prefix for listing.
+   * @param parallelism   Listing parallelism to use if the engine context supports it.
+   * @return A list of  all partition paths with the prefix.
+   * @throws IOException upon error.
+   */
+  public static List<String> getPartitionPathsWithPrefix(HoodieEngineContext engineContext,

Review Comment:
   This method is extracted from `FileSystemBackedTableMetadata` for reuse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1404492353

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619",
       "triggerID" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14637",
       "triggerID" : "1404349041",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "cb54aec5453d7f5a448674ffdcfac2351ccc89f8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14638",
       "triggerID" : "cb54aec5453d7f5a448674ffdcfac2351ccc89f8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * cb54aec5453d7f5a448674ffdcfac2351ccc89f8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14638) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1403085537

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619",
       "triggerID" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 3b3000d68db2b47b46bb3f39d127b4655cfe53b2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618) 
   * 98ff092b7f0142a017132604fd8b3c81fb0fb4d5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1403025684

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c3cc74c973ae760ebd1803dfdc638e9821f181a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411) 
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 3b3000d68db2b47b46bb3f39d127b4655cfe53b2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1404646119

   CI passes except one flaky test which is irrelevant.  Merging this PR.
   <img width="1613" alt="Screen Shot 2023-01-25 at 23 36 42" src="https://user-images.githubusercontent.com/2497195/214781186-a7ef75ca-9ecd-4fd5-b43a-005fafbdaf2b.png">
   <img width="1243" alt="Screen Shot 2023-01-25 at 23 00 37" src="https://user-images.githubusercontent.com/2497195/214781194-6315c469-4ea9-4a09-a424-d44afbe6fee1.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1404349041

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on code in PR #7690:
URL: https://github.com/apache/hudi/pull/7690#discussion_r1086184617


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +292,113 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    synchronized (addedPartitions) {
+      partitionList.forEach(partition ->

Review Comment:
   Addressed.  We still need the `synchronized` block to prevent the listing of the same set of partitions concurrently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1386503584

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ca9fb1e21c08ac0eb7dc6305934f1c58803070e3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1403029815

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c3cc74c973ae760ebd1803dfdc638e9821f181a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411) 
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 3b3000d68db2b47b46bb3f39d127b4655cfe53b2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua merged pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua merged PR #7690:
URL: https://github.com/apache/hudi/pull/7690


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1404369548

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619",
       "triggerID" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14637",
       "triggerID" : "1404349041",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "cb54aec5453d7f5a448674ffdcfac2351ccc89f8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14638",
       "triggerID" : "cb54aec5453d7f5a448674ffdcfac2351ccc89f8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 98ff092b7f0142a017132604fd8b3c81fb0fb4d5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14619) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14637) 
   * cb54aec5453d7f5a448674ffdcfac2351ccc89f8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14638) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1403079877

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a708a002f3bb5dc237e430cff5ce4f43e8a97b51",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618",
       "triggerID" : "3b3000d68db2b47b46bb3f39d127b4655cfe53b2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "98ff092b7f0142a017132604fd8b3c81fb0fb4d5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8c3cc74c973ae760ebd1803dfdc638e9821f181a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411) 
   * a708a002f3bb5dc237e430cff5ce4f43e8a97b51 UNKNOWN
   * 3b3000d68db2b47b46bb3f39d127b4655cfe53b2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14618) 
   * 98ff092b7f0142a017132604fd8b3c81fb0fb4d5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1386365026

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ca9fb1e21c08ac0eb7dc6305934f1c58803070e3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #7690:
URL: https://github.com/apache/hudi/pull/7690#discussion_r1084652972


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/savepoint/SavepointActionExecutor.java:
##########
@@ -84,15 +84,32 @@ public HoodieSavepointMetadata execute() {
           "Could not savepoint commit " + instantTime + " as this is beyond the lookup window " + lastCommitRetained);
 
       context.setJobStatus(this.getClass().getSimpleName(), "Collecting latest files for savepoint " + instantTime + " " + table.getConfig().getTableName());
-      List<String> partitions = FSUtils.getAllPartitionPaths(context, config.getMetadataConfig(), table.getMetaClient().getBasePath());
-      Map<String, List<String>> latestFilesMap = context.mapToPair(partitions, partitionPath -> {
-        // Scan all partitions files with this commit time
-        LOG.info("Collecting latest files in partition path " + partitionPath);
-        TableFileSystemView.BaseFileOnlyView view = table.getBaseFileOnlyView();
-        List<String> latestFiles = view.getLatestBaseFilesBeforeOrOn(partitionPath, instantTime)
-            .map(HoodieBaseFile::getFileName).collect(Collectors.toList());
-        return new ImmutablePair<>(partitionPath, latestFiles);
-      }, null);
+      TableFileSystemView.BaseFileOnlyView view = table.getBaseFileOnlyView();
+
+      Map<String, List<String>> latestFilesMap;
+      // NOTE: for performance, we have to use different logic here for listing the latest files
+      // before or on the given instant:
+      // (1) using metadata-table-based file listing: instead of parallelizing the partition
+      // listing which incurs unnecessary metadata table reads, we directly read the metadata
+      // table once in a batch manner through the timeline server;
+      // (2) using direct file system listing:  we parallelize the partition listing so that
+      // each partition can be listed on the file system concurrently through Spark.
+      if (config.getMetadataConfig().enabled()) {

Review Comment:
   likely there are other code paths in hudi where we can employ similar optimizations. can we file a follow up ticket to triage them and tackle



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java:
##########
@@ -157,8 +156,8 @@ public HoodieWriteConfig.Builder getConfigBuilder(String schemaStr, IndexType in
         .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(indexType).build())
         .withEmbeddedTimelineServerEnabled(true).withFileSystemViewConfig(FileSystemViewStorageConfig.newBuilder()
             .withEnableBackupForRemoteFileSystemView(false) // Fail test if problem connecting to timeline-server
-            .withRemoteServerPort(timelineServicePort)
-            .withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE).build());
+            .withRemoteServerPort(timelineServicePort).build());
+    //.withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE)

Review Comment:
   why commented out? 



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -81,6 +86,8 @@ public abstract class AbstractTableFileSystemView implements SyncableFileSystemV
   private static final Logger LOG = LogManager.getLogger(AbstractTableFileSystemView.class);
 
   protected HoodieTableMetaClient metaClient;
+  // TODO: to pass in the actual config
+  protected boolean assumeDatePartitioning = false;

Review Comment:
   tracking ticket please. 



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -550,6 +674,28 @@ public final Stream<HoodieBaseFile> getLatestBaseFilesBeforeOrOn(String partitio
     }
   }
 
+  @Override

Review Comment:
   should we override this from within MetadataFileSystemView and throw from within AbstractTableFileSystemView ?



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {

Review Comment:
   left a msg below. we should keep this within MetadataFSV. should not let FS based listing call in these by mistake. 



##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -319,6 +320,68 @@ public static List<String> getAllPartitionPaths(HoodieEngineContext engineContex
     }
   }
 
+  /**
+   * Gets all partition paths in a table, based on a relative partition path prefix.
+   *
+   * @param engineContext {@link HoodieEngineContext} instance for file listing.
+   * @param hadoopConf    Hadoop configuration.
+   * @param basePath      Base path of the table.
+   * @param prefix        Relative partition path prefix for listing.
+   * @param parallelism   Listing parallelism to use if the engine context supports it.
+   * @return A list of  all partition paths with the prefix.
+   * @throws IOException upon error.
+   */
+  public static List<String> getPartitionPathsWithPrefix(HoodieEngineContext engineContext,

Review Comment:
   we have FSUtils.getAllPartitionPaths(engineContext, basePath, false, cfg.assumeDatePartitioning) already. Don't think calling prefix based api is the right thing to do here. 



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    partitionList.forEach(partition ->
+        addedPartitions.computeIfAbsent(partition, partitionPathStr -> {
+          if (!isPartitionAvailableInStore(partitionPathStr)) {
+            partitionSet.add(partitionPathStr);
+          }
+          return true;
+        })
+    );
+
+    if (!partitionSet.isEmpty()) {
+      long beginTs = System.currentTimeMillis();
+      // Not loaded yet
+      try {
+        LOG.info("Building file system view for partitions " + partitionSet);
+
+        // Pairs of relative partition path and absolute partition path
+        List<Pair<String, Path>> absolutePartitionPathList = partitionSet.stream()
+            .map(partition -> Pair.of(
+                partition, FSUtils.getPartitionPath(metaClient.getBasePathV2(), partition)))
+            .collect(Collectors.toList());
+        long beginLsTs = System.currentTimeMillis();
+        Map<Pair<String, Path>, FileStatus[]> statusesMap =
+            listPartitions(absolutePartitionPathList);
+        long endLsTs = System.currentTimeMillis();
+        LOG.debug("Time taken to list partitions " + partitionSet + " =" + (endLsTs - beginLsTs));
+        statusesMap.forEach((partitionPair, statuses) -> {
+          String relativePartitionStr = partitionPair.getLeft();
+          List<HoodieFileGroup> groups = addFilesToView(statuses);
+          if (groups.isEmpty()) {
+            storePartitionView(relativePartitionStr, new ArrayList<>());
+          }
+          LOG.debug("#files found in partition (" + relativePartitionStr + ") =" + statuses.length);
+        });
+      } catch (IOException e) {
+        throw new HoodieIOException("Failed to list base files in partitions " + partitionSet, e);
+      }
+      long endTs = System.currentTimeMillis();
+      LOG.debug("Time to load partition " + partitionSet + " =" + (endTs - beginTs));
+    }
+  }
+
+  /***
+   * @return A list of relative partition paths of all partitions.
+   * @throws IOException upon error.
+   */
+  protected List<String> getAllPartitionPaths() throws IOException {

Review Comment:
   let's not support these apis for FS based listing. 



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -550,6 +674,28 @@ public final Stream<HoodieBaseFile> getLatestBaseFilesBeforeOrOn(String partitio
     }
   }
 
+  @Override
+  public final Map<String, Stream<HoodieBaseFile>> getAllLatestBaseFilesBeforeOrOn(String maxCommitTime) {
+    try {
+      readLock.lock();
+
+      List<String> formattedPartitionList = ensureAllPartitionsLoadedCorrectly();
+      return formattedPartitionList.stream().collect(Collectors.toMap(
+          Function.identity(),
+          partitionPath -> fetchAllStoredFileGroups(partitionPath)
+              .filter(fileGroup -> !isFileGroupReplacedBeforeOrOn(fileGroup.getFileGroupId(), maxCommitTime))

Review Comment:
   is it possible to reuse `getLatestBaseFilesBeforeOrOn(String partitionStr, String maxCommitTime)`  to avoid code duplication. just incase we make some bug fix in the other method, might forget to fix here.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -596,8 +742,8 @@ public final Stream<HoodieBaseFile> getLatestBaseFilesInRange(List<String> commi
       return fetchAllStoredFileGroups()
           .filter(fileGroup -> !isFileGroupReplacedBeforeAny(fileGroup.getFileGroupId(), commitsToReturn))
           .map(fileGroup -> Pair.of(fileGroup.getFileGroupId(), Option.fromJavaOptional(
-          fileGroup.getAllBaseFiles().filter(baseFile -> commitsToReturn.contains(baseFile.getCommitTime())
-              && !isBaseFileDueToPendingCompaction(baseFile) && !isBaseFileDueToPendingClustering(baseFile)).findFirst()))).filter(p -> p.getValue().isPresent())
+              fileGroup.getAllBaseFiles().filter(baseFile -> commitsToReturn.contains(baseFile.getCommitTime())

Review Comment:
   can you please revert unintentional changes. will be easy to review



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    partitionList.forEach(partition ->
+        addedPartitions.computeIfAbsent(partition, partitionPathStr -> {

Review Comment:
   we should add entries to addedPartitions only when the file groups pertaining to a particular partition is already loaded. there could be concurrent access to this map (addedPartitions). But in this impl. I see we first add them to the map add later listing files and populate the file groups. We might need to fix that.
   



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -289,6 +296,123 @@ private void clear() {
    */
   protected abstract void resetViewState();
 
+  /**
+   * Batch loading all the partitions if needed.
+   *
+   * @return A list of relative partition paths of all partitions.
+   */
+  private List<String> ensureAllPartitionsLoadedCorrectly() {
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+    try {
+      List<String> formattedPartitionList = getAllPartitionPaths().stream()
+          .map(this::formatPartitionKey).collect(Collectors.toList());
+      ensurePartitionsLoadedCorrectly(formattedPartitionList);
+      return formattedPartitionList;
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to get all partition paths", e);
+    }
+  }
+
+  /**
+   * Allows lazily loading the partitions if needed.
+   *
+   * @param partitionList list of partitions to be loaded if not present.
+   */
+  private void ensurePartitionsLoadedCorrectly(List<String> partitionList) {
+
+    ValidationUtils.checkArgument(!isClosed(), "View is already closed");
+
+    Set<String> partitionSet = new HashSet<>();
+    partitionList.forEach(partition ->
+        addedPartitions.computeIfAbsent(partition, partitionPathStr -> {
+          if (!isPartitionAvailableInStore(partitionPathStr)) {
+            partitionSet.add(partitionPathStr);
+          }
+          return true;
+        })
+    );
+
+    if (!partitionSet.isEmpty()) {
+      long beginTs = System.currentTimeMillis();
+      // Not loaded yet
+      try {
+        LOG.info("Building file system view for partitions " + partitionSet);
+
+        // Pairs of relative partition path and absolute partition path
+        List<Pair<String, Path>> absolutePartitionPathList = partitionSet.stream()
+            .map(partition -> Pair.of(
+                partition, FSUtils.getPartitionPath(metaClient.getBasePathV2(), partition)))
+            .collect(Collectors.toList());
+        long beginLsTs = System.currentTimeMillis();
+        Map<Pair<String, Path>, FileStatus[]> statusesMap =
+            listPartitions(absolutePartitionPathList);
+        long endLsTs = System.currentTimeMillis();
+        LOG.debug("Time taken to list partitions " + partitionSet + " =" + (endLsTs - beginLsTs));
+        statusesMap.forEach((partitionPair, statuses) -> {
+          String relativePartitionStr = partitionPair.getLeft();
+          List<HoodieFileGroup> groups = addFilesToView(statuses);
+          if (groups.isEmpty()) {
+            storePartitionView(relativePartitionStr, new ArrayList<>());
+          }
+          LOG.debug("#files found in partition (" + relativePartitionStr + ") =" + statuses.length);
+        });
+      } catch (IOException e) {
+        throw new HoodieIOException("Failed to list base files in partitions " + partitionSet, e);
+      }
+      long endTs = System.currentTimeMillis();
+      LOG.debug("Time to load partition " + partitionSet + " =" + (endTs - beginTs));
+    }
+  }
+
+  /***
+   * @return A list of relative partition paths of all partitions.
+   * @throws IOException upon error.
+   */
+  protected List<String> getAllPartitionPaths() throws IOException {
+    // TODO: integrate the direct FS listing with the actual engine context
+    LOG.warn("Getting all partition paths with file system listing sequentially can be very slow. "
+        + "This should not be invoked.");
+    String bathPath = metaClient.getBasePathV2().toString();
+    FileSystem fs = new Path(bathPath).getFileSystem(metaClient.getHadoopConf());
+    if (assumeDatePartitioning) {
+      return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, bathPath);
+    }
+    return FSUtils.getPartitionPathsWithPrefix(

Review Comment:
   calling prefix based apis to fetch all partition paths does not look right. lets see how we can fix this properly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1386373072

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ca9fb1e21c08ac0eb7dc6305934f1c58803070e3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1387707894

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ca9fb1e21c08ac0eb7dc6305934f1c58803070e3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373) 
   * 8c3cc74c973ae760ebd1803dfdc638e9821f181a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14411) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7690: [HUDI-5485] Add File System View API for batch listing and improve savepoint performance with metadata table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7690:
URL: https://github.com/apache/hudi/pull/7690#issuecomment-1387691101

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373",
       "triggerID" : "ca9fb1e21c08ac0eb7dc6305934f1c58803070e3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "8c3cc74c973ae760ebd1803dfdc638e9821f181a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ca9fb1e21c08ac0eb7dc6305934f1c58803070e3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14373) 
   * 8c3cc74c973ae760ebd1803dfdc638e9821f181a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org