You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "nsivabalan (via GitHub)" <gi...@apache.org> on 2023/02/03 09:39:39 UTC

[GitHub] [hudi] nsivabalan opened a new pull request, #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize a metadata for a new data table

nsivabalan opened a new pull request, #7841:
URL: https://github.com/apache/hudi/pull/7841

   ### Change Logs
   
   Looks like we are parsing all partitions and files from file system to initialize metadata table for a new data table. Infact, at the end of it, we discard all of them since they are not committed. This patch is optimizing the code block of interest. 
   This impact CTAS specifically. Also fixed some of the job description (spark) along the way. 
   
   Here is the snapshot for a simple CTAS command w/ and w/o the fix. 
   
   <img width="1778" alt="Screen Shot 2023-02-03 at 1 36 55 AM" src="https://user-images.githubusercontent.com/513218/216566324-b88f3f46-723e-4712-b62d-f4ac9dacdb98.png">
   
   w/ the fix
   <img width="1775" alt="Screen Shot 2023-02-03 at 1 13 26 AM" src="https://user-images.githubusercontent.com/513218/216566376-e63374dd-1666-44b5-874f-89221e6dcb92.png">
   
   ### Impact
   
   Improves perf of initialization of metadata table for a new table.
   
   ### Risk level (write none, low medium or high below)
   
   low. 
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1416595925

   CI is green. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1415523535

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 420a1413361efdcfdc864153b2b9f5dc9ff5d448 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1416433656

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897",
       "triggerID" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "triggerType" : "PUSH"
     }, {
       "hash" : "984977457df724c46840f44b6c6d12a873ef7496",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "984977457df724c46840f44b6c6d12a873ef7496",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 420a1413361efdcfdc864153b2b9f5dc9ff5d448 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897) 
   * 984977457df724c46840f44b6c6d12a873ef7496 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1416520669

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897",
       "triggerID" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "triggerType" : "PUSH"
     }, {
       "hash" : "984977457df724c46840f44b6c6d12a873ef7496",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14909",
       "triggerID" : "984977457df724c46840f44b6c6d12a873ef7496",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6744aaeb71255fe112c6b92fb387c52257219bcb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14910",
       "triggerID" : "6744aaeb71255fe112c6b92fb387c52257219bcb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 984977457df724c46840f44b6c6d12a873ef7496 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14909) 
   * 6744aaeb71255fe112c6b92fb387c52257219bcb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14910) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1415814259

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897",
       "triggerID" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 420a1413361efdcfdc864153b2b9f5dc9ff5d448 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a diff in pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.

vinothchandar commented on code in PR #7841:
URL: https://github.com/apache/hudi/pull/7841#discussion_r1096047641


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##########
@@ -280,6 +280,7 @@ protected void commit(HoodieTable table, String commitActionType, String instant
     }
     // update Metadata table
     writeTableMetadata(table, instantTime, commitActionType, metadata);
+    context.setJobStatus(this.getClass().getSimpleName(),"Completing commit in table " + config.getTableName());

Review Comment:
   can we also add the instantTime. So we will know what completed



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java:
##########
@@ -133,6 +133,7 @@ protected void commit(String instantTime, Map<MetadataPartitionType, HoodieData<
     HoodieData<HoodieRecord> preppedRecords = prepRecords(partitionRecordsMap);
     JavaRDD<HoodieRecord> preppedRecordRDD = HoodieJavaRDD.getJavaRDD(preppedRecords);
 
+    engineContext.setJobStatus(this.getClass().getName(), "Committing to metadata table " + metadataWriteConfig.getTableName());

Review Comment:
   add instantTime



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1086,39 +1086,44 @@ private void initialCommit(String createInstantTime, List<MetadataPartitionType>
 
     Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
-    List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);
-    Map<String, Map<String, Long>> partitionToFilesMap = partitionInfoList.stream()
-        .map(p -> {
-          String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-          return Pair.of(partitionName, p.getFileNameToSizeMap());
-        })
-        .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-    int totalDataFilesCount = partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-    List<String> partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-    if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-      // Record which saves the list of all partitions
-      HoodieRecord allPartitionRecord = HoodieMetadataPayload.createPartitionListRecord(partitions);
-      HoodieData<HoodieRecord> filesPartitionRecords = getFilesPartitionRecords(createInstantTime, partitionInfoList, allPartitionRecord);
-      ValidationUtils.checkState(filesPartitionRecords.count() == (partitions.size() + 1));
-      partitionToRecordsMap.put(MetadataPartitionType.FILES, filesPartitionRecords);
-    }
+    // skip parsing file system for metadata records if its first commit for the table.

Review Comment:
   this comment is not clear. whats "parsing" and file system mean? is it listing the file system? is that what you want to say



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1086,39 +1086,44 @@ private void initialCommit(String createInstantTime, List<MetadataPartitionType>
 
     Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
-    List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);
-    Map<String, Map<String, Long>> partitionToFilesMap = partitionInfoList.stream()
-        .map(p -> {
-          String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-          return Pair.of(partitionName, p.getFileNameToSizeMap());
-        })
-        .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-    int totalDataFilesCount = partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-    List<String> partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-    if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-      // Record which saves the list of all partitions
-      HoodieRecord allPartitionRecord = HoodieMetadataPayload.createPartitionListRecord(partitions);
-      HoodieData<HoodieRecord> filesPartitionRecords = getFilesPartitionRecords(createInstantTime, partitionInfoList, allPartitionRecord);
-      ValidationUtils.checkState(filesPartitionRecords.count() == (partitions.size() + 1));
-      partitionToRecordsMap.put(MetadataPartitionType.FILES, filesPartitionRecords);
-    }
+    // skip parsing file system for metadata records if its first commit for the table.
+    if (dataMetaClient.getActiveTimeline().getWriteTimeline().countInstants() > 1
+        || dataMetaClient.getActiveTimeline().getWriteTimeline().filterCompletedInstants().countInstants() != 0) {
+      List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);

Review Comment:
   +1 is there a cleaner way of checking this? this feels a bit obtuse to derive that this is the first commit into the MDT.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7841: [HUDI-5694] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #7841:
URL: https://github.com/apache/hudi/pull/7841#discussion_r1257264487


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##########
@@ -519,6 +519,8 @@ public void preWrite(String instantTime, WriteOperationType writeOperationType,
    */
   protected void postCommit(HoodieTable table, HoodieCommitMetadata metadata, String instantTime, Option<Map<String, String>> extraMetadata) {
     try {
+      context.setJobStatus(this.getClass().getSimpleName(),"Cleaning up marker directories for commit " + instantTime + " in table "
+          + config.getTableName());

Review Comment:
   why need this, in the after `quietDeleteMarkerDir` set it again



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1416376130

   @vinothchandar : I have fixed the screen shots w/ red block to show the unnecessary computation w/o the fix. 
   wrt perf improvement, above screenshots are for a small table. But still we do see 4.5 vs 2s (after the fix) for metadata table instantiation + 1st commit. 
   this might be costly for a large table. lets say you have 10000+ files in the first commit.  we avoid listing all files, fetching the commit time from it and filtering it out. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7841: [HUDI-5694] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.

Zouxxyy commented on code in PR #7841:
URL: https://github.com/apache/hudi/pull/7841#discussion_r1257264487


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##########
@@ -519,6 +519,8 @@ public void preWrite(String instantTime, WriteOperationType writeOperationType,
    */
   protected void postCommit(HoodieTable table, HoodieCommitMetadata metadata, String instantTime, Option<Map<String, String>> extraMetadata) {
     try {
+      context.setJobStatus(this.getClass().getSimpleName(),"Cleaning up marker directories for commit " + instantTime + " in table "
+          + config.getTableName());

Review Comment:
   why need this, in the after `quietDeleteMarkerDir` set it again



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on code in PR #7841:
URL: https://github.com/apache/hudi/pull/7841#discussion_r1096296545


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1086,39 +1086,45 @@ private void initialCommit(String createInstantTime, List<MetadataPartitionType>
 
     Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
-    List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);
-    Map<String, Map<String, Long>> partitionToFilesMap = partitionInfoList.stream()
-        .map(p -> {
-          String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-          return Pair.of(partitionName, p.getFileNameToSizeMap());
-        })
-        .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-    int totalDataFilesCount = partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-    List<String> partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-    if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-      // Record which saves the list of all partitions
-      HoodieRecord allPartitionRecord = HoodieMetadataPayload.createPartitionListRecord(partitions);
-      HoodieData<HoodieRecord> filesPartitionRecords = getFilesPartitionRecords(createInstantTime, partitionInfoList, allPartitionRecord);
-      ValidationUtils.checkState(filesPartitionRecords.count() == (partitions.size() + 1));
-      partitionToRecordsMap.put(MetadataPartitionType.FILES, filesPartitionRecords);
-    }
+    // skip file system listing to populate metadata records if its a fresh table.
+    // this is applicable only if the table already has N commits and metadata is enabled at a later point in time.
+    if (!createInstantTime.equals(SOLO_COMMIT_TIMESTAMP)) { // SOLO_COMMIT_TIMESTAMP will be the intial commit time in MDT for a fresh table.

Review Comment:
   yes, no additional changes. fixed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1416516646

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897",
       "triggerID" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "triggerType" : "PUSH"
     }, {
       "hash" : "984977457df724c46840f44b6c6d12a873ef7496",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14909",
       "triggerID" : "984977457df724c46840f44b6c6d12a873ef7496",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6744aaeb71255fe112c6b92fb387c52257219bcb",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6744aaeb71255fe112c6b92fb387c52257219bcb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 420a1413361efdcfdc864153b2b9f5dc9ff5d448 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897) 
   * 984977457df724c46840f44b6c6d12a873ef7496 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14909) 
   * 6744aaeb71255fe112c6b92fb387c52257219bcb UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.

alexeykudinkin commented on code in PR #7841:
URL: https://github.com/apache/hudi/pull/7841#discussion_r1096285862


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1086,39 +1086,45 @@ private void initialCommit(String createInstantTime, List<MetadataPartitionType>
 
     Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
-    List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);
-    Map<String, Map<String, Long>> partitionToFilesMap = partitionInfoList.stream()
-        .map(p -> {
-          String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-          return Pair.of(partitionName, p.getFileNameToSizeMap());
-        })
-        .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-    int totalDataFilesCount = partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-    List<String> partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-    if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-      // Record which saves the list of all partitions
-      HoodieRecord allPartitionRecord = HoodieMetadataPayload.createPartitionListRecord(partitions);
-      HoodieData<HoodieRecord> filesPartitionRecords = getFilesPartitionRecords(createInstantTime, partitionInfoList, allPartitionRecord);
-      ValidationUtils.checkState(filesPartitionRecords.count() == (partitions.size() + 1));
-      partitionToRecordsMap.put(MetadataPartitionType.FILES, filesPartitionRecords);
-    }
+    // skip file system listing to populate metadata records if its a fresh table.
+    // this is applicable only if the table already has N commits and metadata is enabled at a later point in time.
+    if (!createInstantTime.equals(SOLO_COMMIT_TIMESTAMP)) { // SOLO_COMMIT_TIMESTAMP will be the intial commit time in MDT for a fresh table.

Review Comment:
   You just wrapped this into a conditional, no changes to the branch itself, right?



##########
hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/HoodieDataSourceInternalWriter.java:
##########
@@ -61,13 +60,15 @@ public HoodieDataSourceInternalWriter(String instantTime, HoodieWriteConfig writ
     this.structType = structType;
     this.populateMetaFields = populateMetaFields;
     this.arePartitionRecordsSorted = arePartitionRecordsSorted;
-    this.extraMetadataMap = DataSourceUtils.getExtraMetadata(dataSourceOptions.asMap());
+    this.sparkSession = sparkSession;
+    Map<String, String> extraMetadataMap = DataSourceUtils.getExtraMetadata(dataSourceOptions.asMap());
     this.dataSourceInternalWriterHelper = new DataSourceInternalWriterHelper(instantTime, writeConfig, structType,
         sparkSession, configuration, extraMetadataMap);
   }
 
   @Override
   public DataWriterFactory<InternalRow> createWriterFactory() {
+    sparkSession.sparkContext().setJobGroup(this.getClass().getSimpleName(), "Writing data to files using bulk_insert", true);

Review Comment:
   Let's avoid setting misleading annotations -- this method is just creating the factory but not actually writing any data until later point in time. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##########
@@ -280,6 +280,7 @@ protected void commit(HoodieTable table, String commitActionType, String instant
     }
     // update Metadata table
     writeTableMetadata(table, instantTime, commitActionType, metadata);
+    context.setJobStatus(this.getClass().getSimpleName(),"Completing commit " + instantTime + " in table " + config.getTableName());

Review Comment:
   What do we need this for? There's no Spark job being triggered at that point?



##########
hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/HoodieDataSourceInternalWriter.java:
##########
@@ -51,7 +50,7 @@ public class HoodieDataSourceInternalWriter implements DataSourceWriter {
   private final DataSourceInternalWriterHelper dataSourceInternalWriterHelper;
   private final boolean populateMetaFields;
   private final Boolean arePartitionRecordsSorted;
-  private Map<String, String> extraMetadataMap = new HashMap<>();
+  private final SparkSession sparkSession;

Review Comment:
   Writer will be passed on to executor, you can't add a Spark Session in here



##########
hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/HoodieDataSourceInternalWriter.java:
##########
@@ -89,6 +90,7 @@ public void onDataWriterCommit(WriterCommitMessage message) {
 
   @Override
   public void commit(WriterCommitMessage[] messages) {
+    sparkSession.sparkContext().setJobGroup(this.getClass().getSimpleName(), "Committing to data table", true);

Review Comment:
   Same comment: what's this annotating? There's no Spark jobs being executed at this point
   



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1086,39 +1086,45 @@ private void initialCommit(String createInstantTime, List<MetadataPartitionType>
 
     Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
-    List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);
-    Map<String, Map<String, Long>> partitionToFilesMap = partitionInfoList.stream()
-        .map(p -> {
-          String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-          return Pair.of(partitionName, p.getFileNameToSizeMap());
-        })
-        .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-    int totalDataFilesCount = partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-    List<String> partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-    if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-      // Record which saves the list of all partitions
-      HoodieRecord allPartitionRecord = HoodieMetadataPayload.createPartitionListRecord(partitions);
-      HoodieData<HoodieRecord> filesPartitionRecords = getFilesPartitionRecords(createInstantTime, partitionInfoList, allPartitionRecord);
-      ValidationUtils.checkState(filesPartitionRecords.count() == (partitions.size() + 1));
-      partitionToRecordsMap.put(MetadataPartitionType.FILES, filesPartitionRecords);
-    }
+    // skip file system listing to populate metadata records if its a fresh table.
+    // this is applicable only if the table already has N commits and metadata is enabled at a later point in time.
+    if (!createInstantTime.equals(SOLO_COMMIT_TIMESTAMP)) { // SOLO_COMMIT_TIMESTAMP will be the intial commit time in MDT for a fresh table.

Review Comment:
   Let's also invert conditional (to avoid negation, put smaller block first)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1415552283

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897",
       "triggerID" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 420a1413361efdcfdc864153b2b9f5dc9ff5d448 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on code in PR #7841:
URL: https://github.com/apache/hudi/pull/7841#discussion_r1095685996


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1086,39 +1086,44 @@ private void initialCommit(String createInstantTime, List<MetadataPartitionType>
 
     Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
-    List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);
-    Map<String, Map<String, Long>> partitionToFilesMap = partitionInfoList.stream()
-        .map(p -> {
-          String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-          return Pair.of(partitionName, p.getFileNameToSizeMap());
-        })
-        .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-    int totalDataFilesCount = partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-    List<String> partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-    if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-      // Record which saves the list of all partitions
-      HoodieRecord allPartitionRecord = HoodieMetadataPayload.createPartitionListRecord(partitions);
-      HoodieData<HoodieRecord> filesPartitionRecords = getFilesPartitionRecords(createInstantTime, partitionInfoList, allPartitionRecord);
-      ValidationUtils.checkState(filesPartitionRecords.count() == (partitions.size() + 1));
-      partitionToRecordsMap.put(MetadataPartitionType.FILES, filesPartitionRecords);
-    }
+    // skip parsing file system for metadata records if its first commit for the table.
+    if (dataMetaClient.getActiveTimeline().getWriteTimeline().countInstants() > 1
+        || dataMetaClient.getActiveTimeline().getWriteTimeline().filterCompletedInstants().countInstants() != 0) {
+      List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);

Review Comment:
   > dataMetaClient.getActiveTimeline().getWriteTimeline().countInstants() > 1
   
   How about we have 2 inflight instants on the timeline?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on code in PR #7841:
URL: https://github.com/apache/hudi/pull/7841#discussion_r1096257382


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1086,39 +1086,44 @@ private void initialCommit(String createInstantTime, List<MetadataPartitionType>
 
     Map<MetadataPartitionType, HoodieData<HoodieRecord>> partitionToRecordsMap = new HashMap<>();
 
-    List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);
-    Map<String, Map<String, Long>> partitionToFilesMap = partitionInfoList.stream()
-        .map(p -> {
-          String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-          return Pair.of(partitionName, p.getFileNameToSizeMap());
-        })
-        .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-    int totalDataFilesCount = partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-    List<String> partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-    if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-      // Record which saves the list of all partitions
-      HoodieRecord allPartitionRecord = HoodieMetadataPayload.createPartitionListRecord(partitions);
-      HoodieData<HoodieRecord> filesPartitionRecords = getFilesPartitionRecords(createInstantTime, partitionInfoList, allPartitionRecord);
-      ValidationUtils.checkState(filesPartitionRecords.count() == (partitions.size() + 1));
-      partitionToRecordsMap.put(MetadataPartitionType.FILES, filesPartitionRecords);
-    }
+    // skip parsing file system for metadata records if its first commit for the table.
+    if (dataMetaClient.getActiveTimeline().getWriteTimeline().countInstants() > 1
+        || dataMetaClient.getActiveTimeline().getWriteTimeline().filterCompletedInstants().countInstants() != 0) {
+      List<DirectoryInfo> partitionInfoList = listAllPartitions(dataMetaClient);

Review Comment:
   I have changed the way we deduce the first commit. essentially piggy backing on the initial commit instant time. If its SOLO_COMMIT_TIMESTAMP("00000000000000"), then we know its a fresh table. If not, its an already existing table. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1416385583

   Addressed all comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7841:
URL: https://github.com/apache/hudi/pull/7841#issuecomment-1416440057

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897",
       "triggerID" : "420a1413361efdcfdc864153b2b9f5dc9ff5d448",
       "triggerType" : "PUSH"
     }, {
       "hash" : "984977457df724c46840f44b6c6d12a873ef7496",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14909",
       "triggerID" : "984977457df724c46840f44b6c6d12a873ef7496",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 420a1413361efdcfdc864153b2b9f5dc9ff5d448 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14897) 
   * 984977457df724c46840f44b6c6d12a873ef7496 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14909) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan merged pull request #7841: [HUDI-5496] Avoid unnecessary file system parsing to initialize metadata table for a new data table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan merged PR #7841:
URL: https://github.com/apache/hudi/pull/7841


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org