You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "prashantwason (via GitHub)" <gi...@apache.org> on 2023/04/20 23:36:44 UTC

[GitHub] [hudi] prashantwason opened a new pull request, #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

prashantwason opened a new pull request, #8527:
URL: https://github.com/apache/hudi/pull/8527

   [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.
   
   ### Change Logs
   
   File group creation is parallelized using engineContext.foreach.
   Previous leftover files in the MDT partition are deleted before creation.
   
   ### Impact
   
   Faster file group creation when there are a large number of file groups for a MDT partition.
   Fixes the issue where previous failed initialization could have left over partially or wholly written log files with different instant time.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8527:
URL: https://github.com/apache/hudi/pull/8527#issuecomment-1517080242

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan merged pull request #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan merged PR #8527:
URL: https://github.com/apache/hudi/pull/8527


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8527:
URL: https://github.com/apache/hudi/pull/8527#issuecomment-1517085510

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16512",
       "triggerID" : "346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16512) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8527:
URL: https://github.com/apache/hudi/pull/8527#issuecomment-1517585755

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16512",
       "triggerID" : "346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 346d5c8d8d5d370fb7f57fcf47b040fdc4f51a9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16512) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8527:
URL: https://github.com/apache/hudi/pull/8527#discussion_r1181247986


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -731,21 +733,40 @@ public void initializeMetadataPartitions(HoodieTableMetaClient dataMetaClient, L
    */
   private void initializeFileGroups(HoodieTableMetaClient dataMetaClient, MetadataPartitionType metadataPartition, String instantTime,
                                     int fileGroupCount) throws IOException {
-    final HashMap<HeaderMetadataType, String> blockHeader = new HashMap<>();
-    blockHeader.put(HeaderMetadataType.INSTANT_TIME, instantTime);
+    // Remove all existing file groups or leftover files in the partition
+    final Path partitionPath = new Path(metadataWriteConfig.getBasePath(), metadataPartition.getPartitionPath());
+    FileSystem fs = metadataMetaClient.getFs();
+    try {
+      final FileStatus[] existingFiles = fs.listStatus(partitionPath);
+      if (existingFiles.length > 0) {
+        LOG.warn("Deleting all existing files found in MDT partition " + metadataPartition.getPartitionPath());
+        fs.delete(partitionPath, true);
+        ValidationUtils.checkState(!fs.exists(partitionPath), "Failed to delete MDT partition " + metadataPartition);
+      }
+    } catch (FileNotFoundException e) {

Review Comment:
   if some deletion fails, we will throw all the way right and fail the ingestion thread? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org