You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/27 09:55:44 UTC

[GitHub] [hudi] prashantwason opened a new pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

prashantwason opened a new pull request #3873:
URL: https://github.com/apache/hudi/pull/3873


   ## What is the purpose of the pull request
   
   Metadata Table Bootstrap for very large tables (1200+ partitions, 10Million+ files) does not complete in a reasonable amount of time or leads to OOM on the Spark driver node even with 16GB memory.
   This patch tries to fix the scale issue for bootstrapping very large datasets.
    
   ## Brief change log
   
   Following improvements are implemented:
   1. Memory overhead reduction:
     - Existing code caches FileStatus for each file in memory.
     - Created a new class DirectoryInfo which is used to cache a director's file list with parts of the FileStatus (only filename and file len). This reduces the memory requirements.
   
   2. Improved parallelism:
     - Existing code collects all the listing to the Driver and then creates HoodieRecord on the Driver.
     - This takes a long time for large tables (11million HoodieRecords to be created)
     - Created a new function in SparkRDDWriteClient specifically for bootstrap commit. In it, the HoodieRecord creation is parallelized across executors so it completes fast.
   
   3. Fixed setting to limit the number of parallel listings:
     - Existing code had a bug wherein 1500 executors were hardcoded to perform listing. This leads to exception due to limit in the spark's result memory.
     - Corrected the use of the config.
   
   Result:
   Dataset has 1299 partitions and 12Million files.
   file listing time=1.5mins
   HoodieRecord creation time=13seconds
   deltacommit duration=2.6mins
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests for Hoodie Metadata Table. 
   
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-961588679


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-962413490


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-963956382


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-963954031


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   * fb6efa951068fe303c40e30afbb1eda19175f676 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-964335081


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3251",
       "triggerID" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   * 9acc548e3336720726d63478f6acb0df0b6bfca8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3251) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-964376815


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3251",
       "triggerID" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9acc548e3336720726d63478f6acb0df0b6bfca8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3251) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744140574



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
##########
@@ -36,4 +49,9 @@
   public String getFileExtension() {
     return extension;
   }
+
+  public static boolean isBaseFile(Path path) {

Review comment:
       yeah, thats what I thought too. 
   isLogFile, getFileExtensionFromLog are in FSUtils. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-961588679


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r743232434



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -645,4 +612,83 @@ protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime
     // metadata table.
     writeClient.clean(instantTime + "002");
   }
+
+  /**
+   * Commit the {@code HoodieRecord}s to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<HoodieRecord> records, String partitionName, String instantTime);
+
+  /**
+   * Commit the partition to file listing information to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<DirectoryInfo> dirInfoList, String createInstantTime);
+
+
+  /**
+   * A class which represents a directory and the files and directories inside it.
+   *
+   * A {@code PartitionFileInfo} object saves the name of the partition and various properties requires of each file
+   * required for bootstrapping the metadata table. Saving limited properties reduces the total memory footprint when
+   * a very large number of files are present in the dataset being bootstrapped.
+   */
+  public static class DirectoryInfo implements Serializable {
+    // Relative path of the directory (relative to the base directory)
+    private String relativePath;
+    // List of filenames within this partition
+    private List<String> filenames;
+    // Length of the various files
+    private List<Long> filelengths;
+    // List of directories within this partition
+    private List<Path> subdirs = new ArrayList<>();
+    // Is this a HUDI partition
+    private boolean isPartition = false;
+
+    public DirectoryInfo(String relativePath, FileStatus[] fileStatus) {
+      this.relativePath = relativePath;
+
+      // Pre-allocate with the maximum length possible
+      filenames = new ArrayList<>(fileStatus.length);
+      filelengths = new ArrayList<>(fileStatus.length);
+
+      for (FileStatus status : fileStatus) {
+        if (status.isDirectory()) {
+          this.subdirs.add(status.getPath());
+        } else if (status.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE)) {
+          // Presence of partition meta file implies this is a HUDI partition
+          this.isPartition = true;
+        } else if (FSUtils.isDataFile(status.getPath())) {
+          // Regular HUDI data file (base file or log file)
+          filenames.add(status.getPath().getName());
+          filelengths.add(status.getLen());
+        }
+      }
+    }
+
+    public String getRelativePath() {
+      return relativePath;
+    }
+
+    public int getTotalFiles() {
+      return filenames.size();
+    }
+
+    public boolean isPartition() {
+      return isPartition;
+    }
+
+    public List<Path> getSubdirs() {
+      return subdirs;
+    }
+
+    // Returns a map of filenames mapped to their lengths
+    public Map<String, Long> getFileMap() {

Review comment:
       Done. A good simplification indeed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-962413490


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r745782245



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -147,15 +154,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime, boolean canTriggerTableService) {
+    ValidationUtils.checkState(!partitionInfoList.isEmpty());

Review comment:
       Incase of a fresh table, during bootstrap, won't list be empty? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-952771971


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-964332641


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   * 9acc548e3336720726d63478f6acb0df0b6bfca8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-962419877


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #3873:
URL: https://github.com/apache/hudi/pull/3873


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r740454888



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -645,4 +612,83 @@ protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime
     // metadata table.
     writeClient.clean(instantTime + "002");
   }
+
+  /**
+   * Commit the {@code HoodieRecord}s to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<HoodieRecord> records, String partitionName, String instantTime);
+
+  /**
+   * Commit the partition to file listing information to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<DirectoryInfo> dirInfoList, String createInstantTime);
+
+
+  /**
+   * A class which represents a directory and the files and directories inside it.
+   *
+   * A {@code PartitionFileInfo} object saves the name of the partition and various properties requires of each file
+   * required for bootstrapping the metadata table. Saving limited properties reduces the total memory footprint when
+   * a very large number of files are present in the dataset being bootstrapped.
+   */
+  public static class DirectoryInfo implements Serializable {
+    // Relative path of the directory (relative to the base directory)
+    private String relativePath;
+    // List of filenames within this partition
+    private List<String> filenames;
+    // Length of the various files
+    private List<Long> filelengths;
+    // List of directories within this partition
+    private List<Path> subdirs = new ArrayList<>();
+    // Is this a HUDI partition
+    private boolean isPartition = false;
+
+    public DirectoryInfo(String relativePath, FileStatus[] fileStatus) {
+      this.relativePath = relativePath;
+
+      // Pre-allocate with the maximum length possible
+      filenames = new ArrayList<>(fileStatus.length);
+      filelengths = new ArrayList<>(fileStatus.length);
+
+      for (FileStatus status : fileStatus) {
+        if (status.isDirectory()) {
+          this.subdirs.add(status.getPath());
+        } else if (status.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE)) {
+          // Presence of partition meta file implies this is a HUDI partition
+          this.isPartition = true;
+        } else if (FSUtils.isDataFile(status.getPath())) {
+          // Regular HUDI data file (base file or log file)
+          filenames.add(status.getPath().getName());
+          filelengths.add(status.getLen());
+        }
+      }
+    }
+
+    public String getRelativePath() {
+      return relativePath;
+    }
+
+    public int getTotalFiles() {
+      return filenames.size();
+    }
+
+    public boolean isPartition() {
+      return isPartition;
+    }
+
+    public List<Path> getSubdirs() {
+      return subdirs;
+    }
+
+    // Returns a map of filenames mapped to their lengths
+    public Map<String, Long> getFileMap() {

Review comment:
       if the caller is always going to be interested in, map of file name to length, can we populate the map directly in the constructor of DirectoryInfo() and not have two separate lists only. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {
         FileSystem fs = path.getFileSystem(conf.get());
-        return Pair.of(path, fs.listStatus(path));
-      }, listingParallelism);
-      pathsToList.clear();
+        String relativeDirPath = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path);
+        return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+      }, numDirsToList);
+
+      pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, pathsToList.size()));
 
       // If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to
       // the results.
-      dirToFileListing.forEach(p -> {
-        if (!dirFilterRegex.isEmpty() && p.getLeft().getName().matches(dirFilterRegex)) {
-          LOG.info("Ignoring directory " + p.getLeft() + " which matches the filter regex " + dirFilterRegex);
-          return;
+      for (DirectoryInfo dirInfo : foundDirsList) {
+        if (!dirFilterRegex.isEmpty()) {
+          final String relativePath = dirInfo.getRelativePath();
+          if (!relativePath.isEmpty()) {
+            Path partitionPath = new Path(datasetBasePath, relativePath);
+            if (partitionPath.getName().matches(dirFilterRegex)) {
+              LOG.info("Ignoring directory " + partitionPath + " which matches the filter regex " + dirFilterRegex);
+              continue;
+            }
+          }
         }
 
-        List<FileStatus> filesInDir = Arrays.stream(p.getRight()).parallel()
-            .filter(fs -> !fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
-            .collect(Collectors.toList());
-
-        if (p.getRight().length > filesInDir.size()) {
-          String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
-          // deal with Non-partition table, we should exclude .hoodie
-          partitionToFileStatus.put(partitionName, filesInDir.stream()
-              .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+        if (dirInfo.isPartition()) {
+          // Add to result
+          foundPartitionsList.add(dirInfo);
         } else {
           // Add sub-dirs to the queue
-          pathsToList.addAll(Arrays.stream(p.getRight())
-              .filter(fs -> fs.isDirectory() && !fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-              .map(fs -> fs.getPath())
-              .collect(Collectors.toList()));
+          pathsToList.addAll(dirInfo.getSubdirs());

Review comment:
       Also, I see this in master before this patch 
   ```
   if (p.getRight().length > filesInDir.size()) {
             String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
             // deal with Non-partition table, we should exclude .hoodie
             partitionToFileStatus.put(partitionName, filesInDir.stream()
                 .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
           }
   ```
   may I know how we are handling the same in this patch? 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {

Review comment:
       minor: can we name this `processedDirectories`

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {
         FileSystem fs = path.getFileSystem(conf.get());
-        return Pair.of(path, fs.listStatus(path));
-      }, listingParallelism);
-      pathsToList.clear();
+        String relativeDirPath = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path);
+        return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+      }, numDirsToList);
+
+      pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, pathsToList.size()));
 
       // If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to
       // the results.
-      dirToFileListing.forEach(p -> {
-        if (!dirFilterRegex.isEmpty() && p.getLeft().getName().matches(dirFilterRegex)) {
-          LOG.info("Ignoring directory " + p.getLeft() + " which matches the filter regex " + dirFilterRegex);
-          return;
+      for (DirectoryInfo dirInfo : foundDirsList) {
+        if (!dirFilterRegex.isEmpty()) {
+          final String relativePath = dirInfo.getRelativePath();
+          if (!relativePath.isEmpty()) {
+            Path partitionPath = new Path(datasetBasePath, relativePath);
+            if (partitionPath.getName().matches(dirFilterRegex)) {
+              LOG.info("Ignoring directory " + partitionPath + " which matches the filter regex " + dirFilterRegex);
+              continue;
+            }
+          }
         }
 
-        List<FileStatus> filesInDir = Arrays.stream(p.getRight()).parallel()
-            .filter(fs -> !fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
-            .collect(Collectors.toList());
-
-        if (p.getRight().length > filesInDir.size()) {
-          String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
-          // deal with Non-partition table, we should exclude .hoodie
-          partitionToFileStatus.put(partitionName, filesInDir.stream()
-              .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+        if (dirInfo.isPartition()) {
+          // Add to result
+          foundPartitionsList.add(dirInfo);
         } else {
           // Add sub-dirs to the queue
-          pathsToList.addAll(Arrays.stream(p.getRight())
-              .filter(fs -> fs.isDirectory() && !fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-              .map(fs -> fs.getPath())
-              .collect(Collectors.toList()));
+          pathsToList.addAll(dirInfo.getSubdirs());

Review comment:
       may I know where are we ignoring the meta paths? eg: L460 before this patch. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();

Review comment:
       may be we can name it "partitionsToBootstrap" 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
##########
@@ -36,4 +49,9 @@
   public String getFileExtension() {
     return extension;
   }
+
+  public static boolean isBaseFile(Path path) {

Review comment:
       do you think we should move this to FSUtils? 

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -145,15 +152,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       Recently we added some abstractions to hudi (HoodieData, HoodieJavaRdd, HoodieList). Can we re-use them to avoid duplications across flink and spark. 

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -145,15 +152,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       We can have only one commit() method which deals with HoodieData and abstract out the details. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r740454888



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -645,4 +612,83 @@ protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime
     // metadata table.
     writeClient.clean(instantTime + "002");
   }
+
+  /**
+   * Commit the {@code HoodieRecord}s to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<HoodieRecord> records, String partitionName, String instantTime);
+
+  /**
+   * Commit the partition to file listing information to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<DirectoryInfo> dirInfoList, String createInstantTime);
+
+
+  /**
+   * A class which represents a directory and the files and directories inside it.
+   *
+   * A {@code PartitionFileInfo} object saves the name of the partition and various properties requires of each file
+   * required for bootstrapping the metadata table. Saving limited properties reduces the total memory footprint when
+   * a very large number of files are present in the dataset being bootstrapped.
+   */
+  public static class DirectoryInfo implements Serializable {
+    // Relative path of the directory (relative to the base directory)
+    private String relativePath;
+    // List of filenames within this partition
+    private List<String> filenames;
+    // Length of the various files
+    private List<Long> filelengths;
+    // List of directories within this partition
+    private List<Path> subdirs = new ArrayList<>();
+    // Is this a HUDI partition
+    private boolean isPartition = false;
+
+    public DirectoryInfo(String relativePath, FileStatus[] fileStatus) {
+      this.relativePath = relativePath;
+
+      // Pre-allocate with the maximum length possible
+      filenames = new ArrayList<>(fileStatus.length);
+      filelengths = new ArrayList<>(fileStatus.length);
+
+      for (FileStatus status : fileStatus) {
+        if (status.isDirectory()) {
+          this.subdirs.add(status.getPath());
+        } else if (status.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE)) {
+          // Presence of partition meta file implies this is a HUDI partition
+          this.isPartition = true;
+        } else if (FSUtils.isDataFile(status.getPath())) {
+          // Regular HUDI data file (base file or log file)
+          filenames.add(status.getPath().getName());
+          filelengths.add(status.getLen());
+        }
+      }
+    }
+
+    public String getRelativePath() {
+      return relativePath;
+    }
+
+    public int getTotalFiles() {
+      return filenames.size();
+    }
+
+    public boolean isPartition() {
+      return isPartition;
+    }
+
+    public List<Path> getSubdirs() {
+      return subdirs;
+    }
+
+    // Returns a map of filenames mapped to their lengths
+    public Map<String, Long> getFileMap() {

Review comment:
       if the caller is always going to be interested in, map of file name to length, can we populate the map directly in the constructor of DirectoryInfo() and not have two separate lists only. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {
         FileSystem fs = path.getFileSystem(conf.get());
-        return Pair.of(path, fs.listStatus(path));
-      }, listingParallelism);
-      pathsToList.clear();
+        String relativeDirPath = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path);
+        return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+      }, numDirsToList);
+
+      pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, pathsToList.size()));
 
       // If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to
       // the results.
-      dirToFileListing.forEach(p -> {
-        if (!dirFilterRegex.isEmpty() && p.getLeft().getName().matches(dirFilterRegex)) {
-          LOG.info("Ignoring directory " + p.getLeft() + " which matches the filter regex " + dirFilterRegex);
-          return;
+      for (DirectoryInfo dirInfo : foundDirsList) {
+        if (!dirFilterRegex.isEmpty()) {
+          final String relativePath = dirInfo.getRelativePath();
+          if (!relativePath.isEmpty()) {
+            Path partitionPath = new Path(datasetBasePath, relativePath);
+            if (partitionPath.getName().matches(dirFilterRegex)) {
+              LOG.info("Ignoring directory " + partitionPath + " which matches the filter regex " + dirFilterRegex);
+              continue;
+            }
+          }
         }
 
-        List<FileStatus> filesInDir = Arrays.stream(p.getRight()).parallel()
-            .filter(fs -> !fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
-            .collect(Collectors.toList());
-
-        if (p.getRight().length > filesInDir.size()) {
-          String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
-          // deal with Non-partition table, we should exclude .hoodie
-          partitionToFileStatus.put(partitionName, filesInDir.stream()
-              .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+        if (dirInfo.isPartition()) {
+          // Add to result
+          foundPartitionsList.add(dirInfo);
         } else {
           // Add sub-dirs to the queue
-          pathsToList.addAll(Arrays.stream(p.getRight())
-              .filter(fs -> fs.isDirectory() && !fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-              .map(fs -> fs.getPath())
-              .collect(Collectors.toList()));
+          pathsToList.addAll(dirInfo.getSubdirs());

Review comment:
       Also, I see this in master before this patch 
   ```
   if (p.getRight().length > filesInDir.size()) {
             String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
             // deal with Non-partition table, we should exclude .hoodie
             partitionToFileStatus.put(partitionName, filesInDir.stream()
                 .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
           }
   ```
   may I know how we are handling the same in this patch? 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {

Review comment:
       minor: can we name this `processedDirectories`

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {
         FileSystem fs = path.getFileSystem(conf.get());
-        return Pair.of(path, fs.listStatus(path));
-      }, listingParallelism);
-      pathsToList.clear();
+        String relativeDirPath = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path);
+        return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+      }, numDirsToList);
+
+      pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, pathsToList.size()));
 
       // If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to
       // the results.
-      dirToFileListing.forEach(p -> {
-        if (!dirFilterRegex.isEmpty() && p.getLeft().getName().matches(dirFilterRegex)) {
-          LOG.info("Ignoring directory " + p.getLeft() + " which matches the filter regex " + dirFilterRegex);
-          return;
+      for (DirectoryInfo dirInfo : foundDirsList) {
+        if (!dirFilterRegex.isEmpty()) {
+          final String relativePath = dirInfo.getRelativePath();
+          if (!relativePath.isEmpty()) {
+            Path partitionPath = new Path(datasetBasePath, relativePath);
+            if (partitionPath.getName().matches(dirFilterRegex)) {
+              LOG.info("Ignoring directory " + partitionPath + " which matches the filter regex " + dirFilterRegex);
+              continue;
+            }
+          }
         }
 
-        List<FileStatus> filesInDir = Arrays.stream(p.getRight()).parallel()
-            .filter(fs -> !fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
-            .collect(Collectors.toList());
-
-        if (p.getRight().length > filesInDir.size()) {
-          String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
-          // deal with Non-partition table, we should exclude .hoodie
-          partitionToFileStatus.put(partitionName, filesInDir.stream()
-              .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+        if (dirInfo.isPartition()) {
+          // Add to result
+          foundPartitionsList.add(dirInfo);
         } else {
           // Add sub-dirs to the queue
-          pathsToList.addAll(Arrays.stream(p.getRight())
-              .filter(fs -> fs.isDirectory() && !fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-              .map(fs -> fs.getPath())
-              .collect(Collectors.toList()));
+          pathsToList.addAll(dirInfo.getSubdirs());

Review comment:
       may I know where are we ignoring the meta paths? eg: L460 before this patch. 

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();

Review comment:
       may be we can name it "partitionsToBootstrap" 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
##########
@@ -36,4 +49,9 @@
   public String getFileExtension() {
     return extension;
   }
+
+  public static boolean isBaseFile(Path path) {

Review comment:
       do you think we should move this to FSUtils? 

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -145,15 +152,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       Recently we added some abstractions to hudi (HoodieData, HoodieJavaRdd, HoodieList). Can we re-use them to avoid duplications across flink and spark. 

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -145,15 +152,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       We can have only one commit() method which deals with HoodieData and abstract out the details. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744094960



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -145,15 +152,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       Did not find a way to use HoodieData etc here. Open to suggestions. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-962413253


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   * 84a810e413f22effad661b0955f57e682d8960db UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-963956382


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-963783295


   @prashantwason : let me know if you have addressed all comments, I can spend time to get the abstraction updated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-952771971


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-961588679


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-963951402


   @nsivabalan I have addressed all comments and rebased off master. You can take it over and work on the abstractions now. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-961588679


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r748692780



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/data/HoodieList.java
##########
@@ -132,6 +132,14 @@ public long count() {
     return HoodieList.of(new ArrayList<>(new HashSet<>(listData)));
   }
 
+  @Override
+  public HoodieData<T> union(HoodieData<T> other) {
+    List<T> unionResult = new ArrayList<>();

Review comment:
       small perf improvement if the size is initialized to the sum of the two lists.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-964335081


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3251",
       "triggerID" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   * 9acc548e3336720726d63478f6acb0df0b6bfca8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3251) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-964332641


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9acc548e3336720726d63478f6acb0df0b6bfca8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   * 9acc548e3336720726d63478f6acb0df0b6bfca8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-952771971


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-964004431


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r745810694



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -528,7 +505,7 @@ private void initializeFileGroups(HoodieTableMetaClient dataMetaClient, Metadata
   private <T> void processAndCommit(String instantTime, ConvertMetadataFunction convertMetadataFunction, boolean canTriggerTableService) {
     if (enabled && metadata != null) {
       List<HoodieRecord> records = convertMetadataFunction.convertMetadata();
-      commit(records, MetadataPartitionType.FILES.partitionPath(), instantTime, canTriggerTableService);
+      commit(engineContext.parallelize(records, 1), MetadataPartitionType.FILES.partitionPath(), instantTime, canTriggerTableService);

Review comment:
       not sure if we need to parallelize with records.size(). for regular delta commits, 1 should be good enough. Bootstrap code path is different. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r748600093



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -528,7 +505,7 @@ private void initializeFileGroups(HoodieTableMetaClient dataMetaClient, Metadata
   private <T> void processAndCommit(String instantTime, ConvertMetadataFunction convertMetadataFunction, boolean canTriggerTableService) {
     if (enabled && metadata != null) {
       List<HoodieRecord> records = convertMetadataFunction.convertMetadata();
-      commit(records, MetadataPartitionType.FILES.partitionPath(), instantTime, canTriggerTableService);
+      commit(engineContext.parallelize(records, 1), MetadataPartitionType.FILES.partitionPath(), instantTime, canTriggerTableService);

Review comment:
       Yep. thats seems right.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-952771971


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-962413253


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890) 
   * 84a810e413f22effad661b0955f57e682d8960db UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744140634



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -145,15 +152,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       I can work on it. Once you are done with your changes, do let me know. I can work on HoodieData abstractions and will update the patch. and then you can review my changes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r743232434



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -645,4 +612,83 @@ protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime
     // metadata table.
     writeClient.clean(instantTime + "002");
   }
+
+  /**
+   * Commit the {@code HoodieRecord}s to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<HoodieRecord> records, String partitionName, String instantTime);
+
+  /**
+   * Commit the partition to file listing information to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<DirectoryInfo> dirInfoList, String createInstantTime);
+
+
+  /**
+   * A class which represents a directory and the files and directories inside it.
+   *
+   * A {@code PartitionFileInfo} object saves the name of the partition and various properties requires of each file
+   * required for bootstrapping the metadata table. Saving limited properties reduces the total memory footprint when
+   * a very large number of files are present in the dataset being bootstrapped.
+   */
+  public static class DirectoryInfo implements Serializable {
+    // Relative path of the directory (relative to the base directory)
+    private String relativePath;
+    // List of filenames within this partition
+    private List<String> filenames;
+    // Length of the various files
+    private List<Long> filelengths;
+    // List of directories within this partition
+    private List<Path> subdirs = new ArrayList<>();
+    // Is this a HUDI partition
+    private boolean isPartition = false;
+
+    public DirectoryInfo(String relativePath, FileStatus[] fileStatus) {
+      this.relativePath = relativePath;
+
+      // Pre-allocate with the maximum length possible
+      filenames = new ArrayList<>(fileStatus.length);
+      filelengths = new ArrayList<>(fileStatus.length);
+
+      for (FileStatus status : fileStatus) {
+        if (status.isDirectory()) {
+          this.subdirs.add(status.getPath());
+        } else if (status.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE)) {
+          // Presence of partition meta file implies this is a HUDI partition
+          this.isPartition = true;
+        } else if (FSUtils.isDataFile(status.getPath())) {
+          // Regular HUDI data file (base file or log file)
+          filenames.add(status.getPath().getName());
+          filelengths.add(status.getLen());
+        }
+      }
+    }
+
+    public String getRelativePath() {
+      return relativePath;
+    }
+
+    public int getTotalFiles() {
+      return filenames.size();
+    }
+
+    public boolean isPartition() {
+      return isPartition;
+    }
+
+    public List<Path> getSubdirs() {
+      return subdirs;
+    }
+
+    // Returns a map of filenames mapped to their lengths
+    public Map<String, Long> getFileMap() {

Review comment:
       Done. A good simplification indeed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-962419877


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r743232434



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -645,4 +612,83 @@ protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime
     // metadata table.
     writeClient.clean(instantTime + "002");
   }
+
+  /**
+   * Commit the {@code HoodieRecord}s to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<HoodieRecord> records, String partitionName, String instantTime);
+
+  /**
+   * Commit the partition to file listing information to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List<DirectoryInfo> dirInfoList, String createInstantTime);
+
+
+  /**
+   * A class which represents a directory and the files and directories inside it.
+   *
+   * A {@code PartitionFileInfo} object saves the name of the partition and various properties requires of each file
+   * required for bootstrapping the metadata table. Saving limited properties reduces the total memory footprint when
+   * a very large number of files are present in the dataset being bootstrapped.
+   */
+  public static class DirectoryInfo implements Serializable {
+    // Relative path of the directory (relative to the base directory)
+    private String relativePath;
+    // List of filenames within this partition
+    private List<String> filenames;
+    // Length of the various files
+    private List<Long> filelengths;
+    // List of directories within this partition
+    private List<Path> subdirs = new ArrayList<>();
+    // Is this a HUDI partition
+    private boolean isPartition = false;
+
+    public DirectoryInfo(String relativePath, FileStatus[] fileStatus) {
+      this.relativePath = relativePath;
+
+      // Pre-allocate with the maximum length possible
+      filenames = new ArrayList<>(fileStatus.length);
+      filelengths = new ArrayList<>(fileStatus.length);
+
+      for (FileStatus status : fileStatus) {
+        if (status.isDirectory()) {
+          this.subdirs.add(status.getPath());
+        } else if (status.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE)) {
+          // Presence of partition meta file implies this is a HUDI partition
+          this.isPartition = true;
+        } else if (FSUtils.isDataFile(status.getPath())) {
+          // Regular HUDI data file (base file or log file)
+          filenames.add(status.getPath().getName());
+          filelengths.add(status.getLen());
+        }
+      }
+    }
+
+    public String getRelativePath() {
+      return relativePath;
+    }
+
+    public int getTotalFiles() {
+      return filenames.size();
+    }
+
+    public boolean isPartition() {
+      return isPartition;
+    }
+
+    public List<Path> getSubdirs() {
+      return subdirs;
+    }
+
+    // Returns a map of filenames mapped to their lengths
+    public Map<String, Long> getFileMap() {

Review comment:
       Done. A good simplification indeed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072765



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {

Review comment:
       Renamed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744140450



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {
         FileSystem fs = path.getFileSystem(conf.get());
-        return Pair.of(path, fs.listStatus(path));
-      }, listingParallelism);
-      pathsToList.clear();
+        String relativeDirPath = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path);
+        return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+      }, numDirsToList);
+
+      pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, pathsToList.size()));
 
       // If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to
       // the results.
-      dirToFileListing.forEach(p -> {
-        if (!dirFilterRegex.isEmpty() && p.getLeft().getName().matches(dirFilterRegex)) {
-          LOG.info("Ignoring directory " + p.getLeft() + " which matches the filter regex " + dirFilterRegex);
-          return;
+      for (DirectoryInfo dirInfo : foundDirsList) {
+        if (!dirFilterRegex.isEmpty()) {
+          final String relativePath = dirInfo.getRelativePath();
+          if (!relativePath.isEmpty()) {
+            Path partitionPath = new Path(datasetBasePath, relativePath);
+            if (partitionPath.getName().matches(dirFilterRegex)) {
+              LOG.info("Ignoring directory " + partitionPath + " which matches the filter regex " + dirFilterRegex);
+              continue;
+            }
+          }
         }
 
-        List<FileStatus> filesInDir = Arrays.stream(p.getRight()).parallel()
-            .filter(fs -> !fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
-            .collect(Collectors.toList());
-
-        if (p.getRight().length > filesInDir.size()) {
-          String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
-          // deal with Non-partition table, we should exclude .hoodie
-          partitionToFileStatus.put(partitionName, filesInDir.stream()
-              .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+        if (dirInfo.isPartition()) {
+          // Add to result
+          foundPartitionsList.add(dirInfo);
         } else {
           // Add sub-dirs to the queue
-          pathsToList.addAll(Arrays.stream(p.getRight())
-              .filter(fs -> fs.isDirectory() && !fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-              .map(fs -> fs.getPath())
-              .collect(Collectors.toList()));
+          pathsToList.addAll(dirInfo.getSubdirs());

Review comment:
       sure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r747057928



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -147,15 +154,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime, boolean canTriggerTableService) {
+    ValidationUtils.checkState(!partitionInfoList.isEmpty());

Review comment:
       have fixed it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-952771971






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-964004431


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb6efa951068fe303c40e30afbb1eda19175f676 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3242) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744077361



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -145,15 +152,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       Looking into this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072970



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
##########
@@ -36,4 +49,9 @@
   public String getFileExtension() {
     return extension;
   }
+
+  public static boolean isBaseFile(Path path) {

Review comment:
       I feel its closer to file format but am open to move it if you feel otherwise. Also, the isLogFile    check is in FSUtils since it uses some regex instead of the file extension.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r745808254



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -647,4 +624,96 @@ protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime
     // metadata table.
     writeClient.clean(instantTime + "002");
   }
+
+  /**
+   * This is invoked to bootstrap metadata table for a dataset. Bootstrap Commit has special handling mechanism due to its scale compared to
+   * other regular commits.
+   *
+   */
+  protected void bootstrapCommit(List<DirectoryInfo> partitionInfoList, String createInstantTime) {

Review comment:
       Note to reviewer: I have added HoodieData abstractions to commit() in metadata writer. And bootstrap is just one code across all engines. I see some difference in actual commit() impl across spark and flink and so did not try to generalize it in this patch. 

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -147,15 +154,39 @@ protected void commit(List<HoodieRecord> records, String partitionName, String i
    *
    * The record is tagged with respective file slice's location based on its record key.
    */
-  private JavaRDD<HoodieRecord> prepRecords(List<HoodieRecord> records, String partitionName, int numFileGroups) {
+  private JavaRDD<HoodieRecord> prepRecords(JavaRDD<HoodieRecord> recordsRDD, String partitionName, int numFileGroups) {
     List<FileSlice> fileSlices = HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient, partitionName);
     ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, String.format("Invalid number of file groups: found=%d, required=%d", fileSlices.size(), numFileGroups));
 
-    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
-    return jsc.parallelize(records, 1).map(r -> {
+    return recordsRDD.map(r -> {
       FileSlice slice = fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), numFileGroups));
       r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
       return r;
     });
   }
+
+  @Override
+  protected void commit(List<DirectoryInfo> partitionInfoList, String createInstantTime, boolean canTriggerTableService) {
+    ValidationUtils.checkState(!partitionInfoList.isEmpty());
+
+    JavaSparkContext jsc = ((HoodieSparkEngineContext) engineContext).getJavaSparkContext();
+    List<String> partitions = partitionInfoList.stream().map(p -> p.getRelativePath()).collect(Collectors.toList());
+    final int totalFiles = partitionInfoList.stream().mapToInt(p -> p.getTotalFiles()).sum();
+
+    // Record which saves the list of all partitions
+    HoodieRecord record = HoodieMetadataPayload.createPartitionListRecord(partitions);
+    JavaRDD<HoodieRecord> recordRDD = jsc.parallelize(Arrays.asList(record), 1);
+    if (!partitionInfoList.isEmpty()) {
+      JavaRDD<HoodieRecord> fileListRecords = jsc.parallelize(partitionInfoList, partitionInfoList.size()).map(pinfo -> {
+        // Record which saves files within a partition
+        return HoodieMetadataPayload.createPartitionFilesRecord(
+            pinfo.getRelativePath(), Option.of(pinfo.getFileMap()), Option.empty());
+      });
+      recordRDD = recordRDD.union(fileListRecords);
+    }
+
+    LOG.info("Committing " + partitions.size() + " partitions and " + totalFiles + " files to metadata");
+    ValidationUtils.checkState(recordRDD.count() == (partitions.size() + 1));

Review comment:
       won't we be triggering action on the recordRDD once here (count) and then once again when we are doing some follow up actions? wondering if we really need the size check validation here? this is not that costly, since generating partitionInfoList is the costly one. but just wanted to check. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072738



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
     final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism();
     SerializableConfiguration conf = new SerializableConfiguration(dataMetaClient.getHadoopConf());
     final String dirFilterRegex = dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+    final String datasetBasePath = dataMetaClient.getBasePath();
 
     while (!pathsToList.isEmpty()) {
-      int listingParallelism = Math.min(fileListingParallelism, pathsToList.size());
+      // In each round we will list a section of directories
+      int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
       // List all directories in parallel
-      List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> {
+      List<DirectoryInfo> foundDirsList = engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {
         FileSystem fs = path.getFileSystem(conf.get());
-        return Pair.of(path, fs.listStatus(path));
-      }, listingParallelism);
-      pathsToList.clear();
+        String relativeDirPath = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path);
+        return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+      }, numDirsToList);
+
+      pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, pathsToList.size()));
 
       // If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to
       // the results.
-      dirToFileListing.forEach(p -> {
-        if (!dirFilterRegex.isEmpty() && p.getLeft().getName().matches(dirFilterRegex)) {
-          LOG.info("Ignoring directory " + p.getLeft() + " which matches the filter regex " + dirFilterRegex);
-          return;
+      for (DirectoryInfo dirInfo : foundDirsList) {
+        if (!dirFilterRegex.isEmpty()) {
+          final String relativePath = dirInfo.getRelativePath();
+          if (!relativePath.isEmpty()) {
+            Path partitionPath = new Path(datasetBasePath, relativePath);
+            if (partitionPath.getName().matches(dirFilterRegex)) {
+              LOG.info("Ignoring directory " + partitionPath + " which matches the filter regex " + dirFilterRegex);
+              continue;
+            }
+          }
         }
 
-        List<FileStatus> filesInDir = Arrays.stream(p.getRight()).parallel()
-            .filter(fs -> !fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
-            .collect(Collectors.toList());
-
-        if (p.getRight().length > filesInDir.size()) {
-          String partitionName = FSUtils.getRelativePartitionPath(new Path(dataMetaClient.getBasePath()), p.getLeft());
-          // deal with Non-partition table, we should exclude .hoodie
-          partitionToFileStatus.put(partitionName, filesInDir.stream()
-              .filter(f -> !f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+        if (dirInfo.isPartition()) {
+          // Add to result
+          foundPartitionsList.add(dirInfo);
         } else {
           // Add sub-dirs to the queue
-          pathsToList.addAll(Arrays.stream(p.getRight())
-              .filter(fs -> fs.isDirectory() && !fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-              .map(fs -> fs.getPath())
-              .collect(Collectors.toList()));
+          pathsToList.addAll(dirInfo.getSubdirs());

Review comment:
       DirectoryInfo constructor parses the FileStatus[] array and constructs:
   1. A list of sub-directories
   2. Whether the directory is a partition (presence of partition meta file)
   
   So in the code above, dirInfo.getSubdirs() should only return the sub-directories.
   
   The DirectoryInfo constructor was not ignoring the .hoodie directory and I will code for that. The .hoodie and its sub-dirs will be listed (sub-optimal) but none of them will be found to be partition due to lack of partition meta files. I will update the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072865



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
    * @param dataMetaClient
    * @return Map of partition names to a list of FileStatus for all the files in the partition
    */
-  private Map<String, List<FileStatus>> getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient datasetMetaClient) {
     List<Path> pathsToList = new LinkedList<>();
     pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-    Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+    List<DirectoryInfo> foundPartitionsList = new LinkedList<>();

Review comment:
       Renamed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r745395027



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
##########
@@ -36,4 +49,9 @@
   public String getFileExtension() {
     return extension;
   }
+
+  public static boolean isBaseFile(Path path) {

Review comment:
       moved to FSUtils.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#issuecomment-963954031


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2890",
       "triggerID" : "139369b87cb60e00b7d5f9dbf1db1b6f3bbf3af6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "84a810e413f22effad661b0955f57e682d8960db",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180",
       "triggerID" : "84a810e413f22effad661b0955f57e682d8960db",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fb6efa951068fe303c40e30afbb1eda19175f676",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 84a810e413f22effad661b0955f57e682d8960db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3180) 
   * fb6efa951068fe303c40e30afbb1eda19175f676 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org