You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "codope (via GitHub)" <gi...@apache.org> on 2023/04/06 10:40:24 UTC

[GitHub] [hudi] codope opened a new pull request, #8397: [WIP] Fix COW input format for bootstrap tables

codope opened a new pull request, #8397:
URL: https://github.com/apache/hudi/pull/8397

   ### Change Logs
   
   Push down instantiation of table schema resolver to the point when it's actually needed. This helps in avoiding some cycles reading commit metadata unless needed.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8397:
URL: https://github.com/apache/hudi/pull/8397#issuecomment-1502013715

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "63b8f47e8f7bd8f91ec7f1a675967cdd890516b4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16229",
       "triggerID" : "63b8f47e8f7bd8f91ec7f1a675967cdd890516b4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 63b8f47e8f7bd8f91ec7f1a675967cdd890516b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16229) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on PR #8397:
URL: https://github.com/apache/hudi/pull/8397#issuecomment-1537149743

   @yihua @xiarixiaoyao thanks for reviewing. As mentioned in the comments, the unnecessary instantiation is already removed and we don't need lazy listing in Hive file index impl for now. So, I am going to close the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8397:
URL: https://github.com/apache/hudi/pull/8397#issuecomment-1501811471

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "63b8f47e8f7bd8f91ec7f1a675967cdd890516b4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "63b8f47e8f7bd8f91ec7f1a675967cdd890516b4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 63b8f47e8f7bd8f91ec7f1a675967cdd890516b4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed pull request #8397: [HUDI-6055] Fix input format for bootstrap tables
URL: https://github.com/apache/hudi/pull/8397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "xiarixiaoyao (via GitHub)" <gi...@apache.org>.
xiarixiaoyao commented on code in PR #8397:
URL: https://github.com/apache/hudi/pull/8397#discussion_r1185993071


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java:
##########
@@ -82,32 +81,25 @@ public class SchemaEvolutionContext {
 
   private final InputSplit split;
   private final JobConf job;
-  private HoodieTableMetaClient metaClient;
+  private final HoodieTableMetaClient metaClient;
   public Option<InternalSchema> internalSchemaOption;
 
-  public SchemaEvolutionContext(InputSplit split, JobConf job) throws IOException {
+  public SchemaEvolutionContext(InputSplit split, JobConf job) {
     this(split, job, Option.empty());
   }
 
-  public SchemaEvolutionContext(InputSplit split, JobConf job, Option<HoodieTableMetaClient> metaClientOption) throws IOException {
+  public SchemaEvolutionContext(InputSplit split, JobConf job, Option<HoodieTableMetaClient> metaClientOption) {
     this.split = split;
     this.job = job;
     this.metaClient = metaClientOption.isPresent() ? metaClientOption.get() : setUpHoodieTableMetaClient();

Review Comment:
   now the init of internalSchemaOption has been removed.
   https://github.com/apache/hudi/blob/83d4fe15b4f50a58e919390217b71bb95e952085/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java#L91  need to modify
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on code in PR #8397:
URL: https://github.com/apache/hudi/pull/8397#discussion_r1185708463


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java:
##########
@@ -159,27 +151,26 @@ public void doEvolutionForRealtimeInputFormat(AbstractRealtimeRecordReader realt
    * Do schema evolution for ParquetFormat.
    */
   public void doEvolutionForParquetFormat() {
-    if (internalSchemaOption.isPresent()) {
+    List<String> requiredColumns = getRequireColumn(job);
+    // No need trigger schema evolution for count(*)/count(1) operation
+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
+    if (!disableSchemaEvolution) {
+      if (!internalSchemaOption.isPresent()) {

Review Comment:
   Should this condition be `internalSchemaOption == null` since it may not be initialized?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java:
##########
@@ -159,27 +151,26 @@ public void doEvolutionForRealtimeInputFormat(AbstractRealtimeRecordReader realt
    * Do schema evolution for ParquetFormat.
    */
   public void doEvolutionForParquetFormat() {
-    if (internalSchemaOption.isPresent()) {
+    List<String> requiredColumns = getRequireColumn(job);
+    // No need trigger schema evolution for count(*)/count(1) operation
+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
+    if (!disableSchemaEvolution) {
+      if (!internalSchemaOption.isPresent()) {
+        internalSchemaOption = new TableSchemaResolver(metaClient).getTableInternalSchemaFromCommitMetadata();
+      }
       // reading hoodie schema evolution table
       job.setBoolean(HIVE_EVOLUTION_ENABLE, true);
-      Path finalPath = ((FileSplit)split).getPath();
+      Path finalPath = ((FileSplit) split).getPath();
       InternalSchema prunedSchema;
-      List<String> requiredColumns = getRequireColumn(job);
-      // No need trigger schema evolution for count(*)/count(1) operation
-      boolean disableSchemaEvolution =
-          requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
-      if (!disableSchemaEvolution) {
-        prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchemaOption.get(), requiredColumns);
-        InternalSchema querySchema = prunedSchema;
-        Long commitTime = Long.valueOf(FSUtils.getCommitTime(finalPath.getName()));
-        InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitTime, metaClient, false);
-        InternalSchema mergedInternalSchema = new InternalSchemaMerger(fileSchema, querySchema, true,
-            true).mergeSchema();
-        List<Types.Field> fields = mergedInternalSchema.columns();
-        setColumnNameList(job, fields);
-        setColumnTypeList(job, fields);
-        pushDownFilter(job, querySchema, fileSchema);
-      }
+      prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchemaOption.get(), requiredColumns);
+      InternalSchema querySchema = prunedSchema;
+      Long commitTime = Long.valueOf(FSUtils.getCommitTime(finalPath.getName()));
+      InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitTime, metaClient, false);
+      InternalSchema mergedInternalSchema = new InternalSchemaMerger(fileSchema, querySchema, true, true).mergeSchema();
+      List<Types.Field> fields = mergedInternalSchema.columns();
+      setColumnNameList(job, fields);
+      setColumnTypeList(job, fields);
+      pushDownFilter(job, querySchema, fileSchema);

Review Comment:
   and should this part be guarded by `!internalSchemaOption.isPresent()`?



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -314,9 +314,7 @@ protected HoodieTimeline getActiveTimeline() {
   private Object[] parsePartitionColumnValues(String[] partitionColumns, String partitionPath) {
     Object[] partitionColumnValues = doParsePartitionColumnValues(partitionColumns, partitionPath);
     if (shouldListLazily && partitionColumnValues.length != partitionColumns.length) {
-      throw new HoodieException("Failed to parse partition column values from the partition-path:"
-          + " likely non-encoded slashes being used in partition column's values. You can try to"
-          + " work this around by switching listing mode to eager");
+      LOG.warn(">>> PartitionColumns: " + partitionColumns + "  PartitionValues: " + partitionColumnValues);

Review Comment:
   So I assume we still need to fail here instead of printing the warning and letting it return?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java:
##########
@@ -159,27 +151,26 @@ public void doEvolutionForRealtimeInputFormat(AbstractRealtimeRecordReader realt
    * Do schema evolution for ParquetFormat.
    */
   public void doEvolutionForParquetFormat() {
-    if (internalSchemaOption.isPresent()) {
+    List<String> requiredColumns = getRequireColumn(job);
+    // No need trigger schema evolution for count(*)/count(1) operation
+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());

Review Comment:
   I see.  This is existing logic.  Still wondering the same question.



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java:
##########
@@ -159,27 +151,26 @@ public void doEvolutionForRealtimeInputFormat(AbstractRealtimeRecordReader realt
    * Do schema evolution for ParquetFormat.
    */
   public void doEvolutionForParquetFormat() {
-    if (internalSchemaOption.isPresent()) {
+    List<String> requiredColumns = getRequireColumn(job);
+    // No need trigger schema evolution for count(*)/count(1) operation
+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());

Review Comment:
   To clarify, do `requiredColumns` contain the columns from the predicate(s), e.g., `count(*) where col1 is not null`?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java:
##########
@@ -159,27 +151,26 @@ public void doEvolutionForRealtimeInputFormat(AbstractRealtimeRecordReader realt
    * Do schema evolution for ParquetFormat.
    */
   public void doEvolutionForParquetFormat() {
-    if (internalSchemaOption.isPresent()) {
+    List<String> requiredColumns = getRequireColumn(job);
+    // No need trigger schema evolution for count(*)/count(1) operation
+    boolean disableSchemaEvolution = requiredColumns.isEmpty() || (requiredColumns.size() == 1 && requiredColumns.get(0).isEmpty());
+    if (!disableSchemaEvolution) {
+      if (!internalSchemaOption.isPresent()) {
+        internalSchemaOption = new TableSchemaResolver(metaClient).getTableInternalSchemaFromCommitMetadata();

Review Comment:
   Still do try .. catch .. here in case the internal schema cannot be read?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java:
##########
@@ -58,7 +58,7 @@ public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
         shouldIncludePendingCommits,
         true,
         new NoopCache(),
-        false);
+        true);

Review Comment:
   Now I remember we need to fix the lazy listing for Hvie File Index.  Should this be in a separate PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8397:
URL: https://github.com/apache/hudi/pull/8397#issuecomment-1501817799

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "63b8f47e8f7bd8f91ec7f1a675967cdd890516b4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16229",
       "triggerID" : "63b8f47e8f7bd8f91ec7f1a675967cdd890516b4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 63b8f47e8f7bd8f91ec7f1a675967cdd890516b4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16229) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #8397:
URL: https://github.com/apache/hudi/pull/8397#discussion_r1186702216


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java:
##########
@@ -82,32 +81,25 @@ public class SchemaEvolutionContext {
 
   private final InputSplit split;
   private final JobConf job;
-  private HoodieTableMetaClient metaClient;
+  private final HoodieTableMetaClient metaClient;
   public Option<InternalSchema> internalSchemaOption;
 
-  public SchemaEvolutionContext(InputSplit split, JobConf job) throws IOException {
+  public SchemaEvolutionContext(InputSplit split, JobConf job) {
     this(split, job, Option.empty());
   }
 
-  public SchemaEvolutionContext(InputSplit split, JobConf job, Option<HoodieTableMetaClient> metaClientOption) throws IOException {
+  public SchemaEvolutionContext(InputSplit split, JobConf job, Option<HoodieTableMetaClient> metaClientOption) {
     this.split = split;
     this.job = job;
     this.metaClient = metaClientOption.isPresent() ? metaClientOption.get() : setUpHoodieTableMetaClient();

Review Comment:
   oh ok then there is no need of this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #8397: [HUDI-6055] Fix input format for bootstrap tables

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #8397:
URL: https://github.com/apache/hudi/pull/8397#discussion_r1186702204


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java:
##########
@@ -58,7 +58,7 @@ public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
         shouldIncludePendingCommits,
         true,
         new NoopCache(),
-        false);
+        true);

Review Comment:
   Yes, this should not have been part of this PR. Actually, it doesn't really matter for Hudi connector as it doesn't go through COW input format code. And for Hive connector, we already saw that it was partition loader will instantiate this every call. However, the actual perf issue for Hive connector was fixed due to https://github.com/apache/hudi/pull/7527#discussion_r1054022432
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org