You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "voonhous (via GitHub)" <gi...@apache.org> on 2023/04/19 09:44:56 UTC

[GitHub] [hudi] voonhous opened a new pull request, #8501: [HUDI-6103] Validate required columns when fetching required positions

voonhous opened a new pull request, #8501:
URL: https://github.com/apache/hudi/pull/8501

   ### Change Logs
   
   Throw an error if a required field is not found in Hudi table schema. 
   
   Please refer to [HUDI-6103](https://issues.apache.org/jira/browse/HUDI-6103) for more details
   
   ### Impact
   
   A more informative error message will be thrown to the user when a typo/wrong column name is used. Improved user-friendliness.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [X] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1514456765

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d8a60f864ecc2906ad70d4791badde3bec7a3e98 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1172337459


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadTableState.java:
##########
@@ -85,7 +86,7 @@ public int getOperationPos() {
   public int[] getRequiredPositions() {
     final List<String> fieldNames = rowType.getFieldNames();
     return requiredRowType.getFieldNames().stream()
-        .map(fieldNames::indexOf)
+        .map(f -> getIdxOfFieldOrElseThrow(f, fieldNames))
         .mapToInt(i -> i)

Review Comment:
   So, in which case the fields can be missing in the schema then? Where these fields come from?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1172378438


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadTableState.java:
##########
@@ -85,7 +86,7 @@ public int getOperationPos() {
   public int[] getRequiredPositions() {
     final List<String> fieldNames = rowType.getFieldNames();
     return requiredRowType.getFieldNames().stream()
-        .map(fieldNames::indexOf)
+        .map(f -> getIdxOfFieldOrElseThrow(f, fieldNames))
         .mapToInt(i -> i)

Review Comment:
   Then just validate the schema compatibility when constructing the pipeline, like the `HoodieTableFactory`, let's not defer the validation in flink runtime.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1523194773

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460",
       "triggerID" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb329b30e88f2b1a84418ddf451839dab7bb7948",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb329b30e88f2b1a84418ddf451839dab7bb7948",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d8a60f864ecc2906ad70d4791badde3bec7a3e98 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460) 
   * bb329b30e88f2b1a84418ddf451839dab7bb7948 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1178169075


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##########
@@ -401,4 +408,35 @@ private static void inferAvroSchema(Configuration conf, LogicalType rowType) {
       conf.setString(FlinkOptions.SOURCE_AVRO_SCHEMA, inferredSchema);
     }
   }
+
+  /**
+   *
+   * @param conf The configuration
+   */
+  private static void validateSourceSchema(Configuration conf) {
+    final HoodieTableMetaClient metaClient = StreamerUtil.metaClientForReader(conf, HadoopConfigurations.getHadoopConf(conf));
+    final TableSchemaResolver schemaResolver = new TableSchemaResolver(metaClient);
+    final Schema requiredSchema = StreamerUtil.getSourceSchema(conf);
+    final Schema srcSchema;
+
+    try {
+      srcSchema = schemaResolver.getTableAvroSchema();
+    } catch (Exception e) {
+      LOG.warn("Skipping validation for requiredSchema as table avro schema could not be fetched", e);
+      return;
+    }
+

Review Comment:
   @danny0405 Yeap, agree.
   
   If you feel that this PR is not required, I'll close this then. I'll apply it into our internal version instead. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1172090557


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadTableState.java:
##########
@@ -85,7 +86,7 @@ public int getOperationPos() {
   public int[] getRequiredPositions() {
     final List<String> fieldNames = rowType.getFieldNames();
     return requiredRowType.getFieldNames().stream()
-        .map(fieldNames::indexOf)
+        .map(f -> getIdxOfFieldOrElseThrow(f, fieldNames))
         .mapToInt(i -> i)

Review Comment:
   Hmmm, from our experience, it happens quite often when user is submitting a streaming job. 
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1177909927


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##########
@@ -401,4 +408,35 @@ private static void inferAvroSchema(Configuration conf, LogicalType rowType) {
       conf.setString(FlinkOptions.SOURCE_AVRO_SCHEMA, inferredSchema);
     }
   }
+
+  /**
+   *
+   * @param conf The configuration
+   */
+  private static void validateSourceSchema(Configuration conf) {
+    final HoodieTableMetaClient metaClient = StreamerUtil.metaClientForReader(conf, HadoopConfigurations.getHadoopConf(conf));
+    final TableSchemaResolver schemaResolver = new TableSchemaResolver(metaClient);
+    final Schema requiredSchema = StreamerUtil.getSourceSchema(conf);
+    final Schema srcSchema;
+
+    try {
+      srcSchema = schemaResolver.getTableAvroSchema();
+    } catch (Exception e) {
+      LOG.warn("Skipping validation for requiredSchema as table avro schema could not be fetched", e);
+      return;
+    }
+

Review Comment:
   Reasonable, but restraict the entrence of the SQL queries is another way to mediate the issue, do you agree?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1172043586


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadTableState.java:
##########
@@ -85,7 +86,7 @@ public int getOperationPos() {
   public int[] getRequiredPositions() {
     final List<String> fieldNames = rowType.getFieldNames();
     return requiredRowType.getFieldNames().stream()
-        .map(fieldNames::indexOf)
+        .map(f -> getIdxOfFieldOrElseThrow(f, fieldNames))
         .mapToInt(i -> i)

Review Comment:
   We should avoid unnecessary check on the code when it never happens.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1524529704

   Close because it is invalid.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 closed pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 closed pull request #8501: [HUDI-6103] Validate required columns when fetching required positions
URL: https://github.com/apache/hudi/pull/8501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1514468678

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460",
       "triggerID" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d8a60f864ecc2906ad70d4791badde3bec7a3e98 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1515287480

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460",
       "triggerID" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d8a60f864ecc2906ad70d4791badde3bec7a3e98 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1177685237


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##########
@@ -401,4 +408,35 @@ private static void inferAvroSchema(Configuration conf, LogicalType rowType) {
       conf.setString(FlinkOptions.SOURCE_AVRO_SCHEMA, inferredSchema);
     }
   }
+
+  /**
+   *
+   * @param conf The configuration
+   */
+  private static void validateSourceSchema(Configuration conf) {
+    final HoodieTableMetaClient metaClient = StreamerUtil.metaClientForReader(conf, HadoopConfigurations.getHadoopConf(conf));
+    final TableSchemaResolver schemaResolver = new TableSchemaResolver(metaClient);
+    final Schema requiredSchema = StreamerUtil.getSourceSchema(conf);
+    final Schema srcSchema;
+
+    try {
+      srcSchema = schemaResolver.getTableAvroSchema();
+    } catch (Exception e) {
+      LOG.warn("Skipping validation for requiredSchema as table avro schema could not be fetched", e);
+      return;
+    }
+

Review Comment:
   Yes, there will not be any schema mismatches if table definitions are managed via catalogs. However, we have use cases where tables are not managed via a catalog.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1514443598

   @danny0405 Can you please help to review this? 
   
   Small PR to improve user friendliness. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1172362684


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadTableState.java:
##########
@@ -85,7 +86,7 @@ public int getOperationPos() {
   public int[] getRequiredPositions() {
     final List<String> fieldNames = rowType.getFieldNames();
     return requiredRowType.getFieldNames().stream()
-        .map(fieldNames::indexOf)
+        .map(f -> getIdxOfFieldOrElseThrow(f, fieldNames))
         .mapToInt(i -> i)

Review Comment:
   Here's a concrete example:
   
   A user wants to read a hudi-source with the following schema:
   ```
   hudi-source(
       `id` INT,
       `user_id` INT,
       `name` STRING,
       `partition_col` STRING
   ) partitioned by (`partition_col`)
   ```
   
   When writing submitting a flink-sql file to be executed in the stream mode, he submitted a table DDL as such:
   
   ~~user_id~~ -> driver_id
   
   ```
   hudi-source(
       `id` INT,
       `driver_id` INT,
       `name` STRING,
       `partition_col` STRING
   ) partitioned by (`partition_col`)
   ```
   
   When something like this occurs, a VERY OBTUSE error is thrown:
   
   ```
   Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
       at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.lambda$genPartColumnarRowReader$0(ParquetSplitReaderUtil.java:119)
       at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
       at java.util.Spliterators$IntArraySpliterator.forEachRemaining(Spliterators.java:1032)
       at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
       at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
       at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
       at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
       at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
   ...
   ```
   
   For something that is pretty easy to fix, it shouldn't be throwing such an OBTUSE error... So, this PR aims to make such errors more obvious.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1172381928


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadTableState.java:
##########
@@ -85,7 +86,7 @@ public int getOperationPos() {
   public int[] getRequiredPositions() {
     final List<String> fieldNames = rowType.getFieldNames();
     return requiredRowType.getFieldNames().stream()
-        .map(fieldNames::indexOf)
+        .map(f -> getIdxOfFieldOrElseThrow(f, fieldNames))
         .mapToInt(i -> i)

Review Comment:
   That make sense, let me explore this approach.
   
   I placed it in HoodieTableSource as metaclient was conveniently constructed (required for tableAvroSchema).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1177710126


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##########
@@ -401,4 +408,35 @@ private static void inferAvroSchema(Configuration conf, LogicalType rowType) {
       conf.setString(FlinkOptions.SOURCE_AVRO_SCHEMA, inferredSchema);
     }
   }
+
+  /**
+   *
+   * @param conf The configuration
+   */
+  private static void validateSourceSchema(Configuration conf) {
+    final HoodieTableMetaClient metaClient = StreamerUtil.metaClientForReader(conf, HadoopConfigurations.getHadoopConf(conf));
+    final TableSchemaResolver schemaResolver = new TableSchemaResolver(metaClient);
+    final Schema requiredSchema = StreamerUtil.getSourceSchema(conf);
+    final Schema srcSchema;
+
+    try {
+      srcSchema = schemaResolver.getTableAvroSchema();
+    } catch (Exception e) {
+      LOG.warn("Skipping validation for requiredSchema as table avro schema could not be fetched", e);
+      return;
+    }
+

Review Comment:
   I mean, Hudi is allowing users to define avro schemas using `SOURCE_SCHEMA` too. 
   
   There are plenty of entrypoints that could lead to schema inconsistencies. If those entrypoints exists, i feel it's reasonable to at least perform some level of validation, and not throw an error message that is not helpful at all. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8501:
URL: https://github.com/apache/hudi/pull/8501#discussion_r1177669046


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##########
@@ -401,4 +408,35 @@ private static void inferAvroSchema(Configuration conf, LogicalType rowType) {
       conf.setString(FlinkOptions.SOURCE_AVRO_SCHEMA, inferredSchema);
     }
   }
+
+  /**
+   *
+   * @param conf The configuration
+   */
+  private static void validateSourceSchema(Configuration conf) {
+    final HoodieTableMetaClient metaClient = StreamerUtil.metaClientForReader(conf, HadoopConfigurations.getHadoopConf(conf));
+    final TableSchemaResolver schemaResolver = new TableSchemaResolver(metaClient);
+    final Schema requiredSchema = StreamerUtil.getSourceSchema(conf);
+    final Schema srcSchema;
+
+    try {
+      srcSchema = schemaResolver.getTableAvroSchema();
+    } catch (Exception e) {
+      LOG.warn("Skipping validation for requiredSchema as table avro schema could not be fetched", e);
+      return;
+    }
+

Review Comment:
   I don't know why the schema mismatch for the queries, somehow you need a catalog to manage the table and schema altogether, query a table with same path but wrong schema is meaningless, and is the wrong schema hand-written by user each time? Why not just fetch the schema through the catalog?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1523878950

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460",
       "triggerID" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb329b30e88f2b1a84418ddf451839dab7bb7948",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16678",
       "triggerID" : "bb329b30e88f2b1a84418ddf451839dab7bb7948",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bb329b30e88f2b1a84418ddf451839dab7bb7948 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16678) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8501: [HUDI-6103] Validate required columns when fetching required positions

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8501:
URL: https://github.com/apache/hudi/pull/8501#issuecomment-1523204539

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460",
       "triggerID" : "d8a60f864ecc2906ad70d4791badde3bec7a3e98",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb329b30e88f2b1a84418ddf451839dab7bb7948",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16678",
       "triggerID" : "bb329b30e88f2b1a84418ddf451839dab7bb7948",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d8a60f864ecc2906ad70d4791badde3bec7a3e98 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16460) 
   * bb329b30e88f2b1a84418ddf451839dab7bb7948 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16678) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org