You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/28 01:38:33 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request, #5708: [WIP][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

alexeykudinkin opened a new pull request, #5708:
URL: https://github.com/apache/hudi/pull/5708

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   This PR addresses following issues:
    - Fixes schema delineation on partition/data schema in Spark relations
    - Properly attributes base-file reader schema
   
   Additionally
    - Removing unnecessary projections
    - Consolidating Avro records projections w/in `SafeAvroProjection` abstraction
    - Unifying base-file readers creation for MOR relations
   
   ## Brief change log
   
   See above 
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   This change added tests and can be verified as follows:
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r928122395


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -274,7 +274,7 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
   def canPruneRelationSchema: Boolean =
     (fileFormat.isInstanceOf[ParquetFileFormat] || fileFormat.isInstanceOf[OrcFileFormat]) &&
       // NOTE: Some relations might be disabling sophisticated schema pruning techniques (for ex, nested schema pruning)
-      // TODO(HUDI-XXX) internal schema doesn't supported nested schema pruning currently
+      // TODO(HUDI-XXX) internal schema doesn't support nested schema pruning currently

Review Comment:
   Pls raise a PR, and I will repair it later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1191711419

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96de48522261c88ab79bdd04f75b41808f9d3f44 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111) 
   * db793c0aa2e3db4a114d4e86a0249b3bf36188d2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1190845038

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 372c9b452bf9894c544b77a2798c5581bceab48e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076) 
   * 96de48522261c88ab79bdd04f75b41808f9d3f44 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan merged pull request #5708: [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
xushiyan merged PR #5708:
URL: https://github.com/apache/hudi/pull/5708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1189742888

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 372c9b452bf9894c544b77a2798c5581bceab48e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1190847642

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 372c9b452bf9894c544b77a2798c5581bceab48e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076) 
   * 96de48522261c88ab79bdd04f75b41808f9d3f44 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1189681658

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b21276b6bcab5b88ee7c01d428a107da4a2fc5ad Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055) 
   * 372c9b452bf9894c544b77a2798c5581bceab48e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r927098140


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -564,42 +538,57 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
     //       we have to eagerly initialize all of the readers even though only one specific to the type
     //       of the file being read will be used. This is required to avoid serialization of the whole
     //       relation (containing file-index for ex) and passing it to the executor
-    val reader = tableBaseFileFormat match {
-      case HoodieFileFormat.PARQUET =>
-        HoodieDataSourceHelper.buildHoodieParquetReader(
-          sparkSession = spark,
-          dataSchema = dataSchema.structTypeSchema,
-          partitionSchema = partitionSchema,
-          requiredSchema = requiredSchema.structTypeSchema,
-          filters = filters,
-          options = options,
-          hadoopConf = hadoopConf,
-          // We're delegating to Spark to append partition values to every row only in cases
-          // when these corresponding partition-values are not persisted w/in the data file itself
-          appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
-        )
+    val (read: (PartitionedFile => Iterator[InternalRow]), schema: StructType) =
+      tableBaseFileFormat match {
+        case HoodieFileFormat.PARQUET =>
+          (
+            HoodieDataSourceHelper.buildHoodieParquetReader(
+              sparkSession = spark,
+              dataSchema = dataSchema.structTypeSchema,
+              partitionSchema = partitionSchema,
+              requiredSchema = requiredSchema.structTypeSchema,
+              filters = filters,
+              options = options,
+              hadoopConf = hadoopConf,
+              // We're delegating to Spark to append partition values to every row only in cases
+              // when these corresponding partition-values are not persisted w/in the data file itself
+              appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
+            ),
+            // Since partition values by default are omitted, and not persisted w/in data-files by Spark,
+            // data-file readers (such as [[ParquetFileFormat]]) have to inject partition values while reading
+            // the data. As such, actual full schema produced by such reader is composed of
+            //    a) Prepended partition column values
+            //    b) Data-file schema (projected or not)
+            StructType(partitionSchema.fields ++ requiredSchema.structTypeSchema.fields)

Review Comment:
   why prepend not append?  curious to know about the considerations



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala:
##########
@@ -129,10 +156,13 @@ class HoodieMergeOnReadRDD(@transient sc: SparkContext,
     //          a) It does use one of the standard (and whitelisted) Record Payload classes
     //       then we can avoid reading and parsing the records w/ _full_ schema, and instead only
     //       rely on projected one, nevertheless being able to perform merging correctly
-    if (!whitelistedPayloadClasses.contains(tableState.recordPayloadClassName))
-      (fileReaders.fullSchemaFileReader(split.dataFile.get), dataSchema)
-    else
-      (fileReaders.requiredSchemaFileReaderForMerging(split.dataFile.get), requiredSchema)
+    val reader = if (!whitelistedPayloadClasses.contains(tableState.recordPayloadClassName)) {

Review Comment:
   /nit i'd prefer if() without negation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1193038313

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183",
       "triggerID" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e121b002945d5aad30b00893dbb00a706be2ebba",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10225",
       "triggerID" : "e121b002945d5aad30b00893dbb00a706be2ebba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e121b002945d5aad30b00893dbb00a706be2ebba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10225) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [WIP][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1140181562

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4d99164bb73769fedbb62c7a3688146f6cbd92c7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1192181156

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183",
       "triggerID" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 192e15ade1d6b8a291d003477b287bd7a5ef9e76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
YannByron commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r928123957


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -564,42 +538,56 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
     //       we have to eagerly initialize all of the readers even though only one specific to the type
     //       of the file being read will be used. This is required to avoid serialization of the whole
     //       relation (containing file-index for ex) and passing it to the executor
-    val reader = tableBaseFileFormat match {
-      case HoodieFileFormat.PARQUET =>
-        HoodieDataSourceHelper.buildHoodieParquetReader(
-          sparkSession = spark,
-          dataSchema = dataSchema.structTypeSchema,
-          partitionSchema = partitionSchema,
-          requiredSchema = requiredSchema.structTypeSchema,
-          filters = filters,
-          options = options,
-          hadoopConf = hadoopConf,
-          // We're delegating to Spark to append partition values to every row only in cases
-          // when these corresponding partition-values are not persisted w/in the data file itself
-          appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
-        )
+    val (read: (PartitionedFile => Iterator[InternalRow]), schema: StructType) =
+      tableBaseFileFormat match {
+        case HoodieFileFormat.PARQUET =>
+          val parquetReader = HoodieDataSourceHelper.buildHoodieParquetReader(
+            sparkSession = spark,
+            dataSchema = dataSchema.structTypeSchema,
+            partitionSchema = partitionSchema,
+            requiredSchema = requiredDataSchema.structTypeSchema,
+            filters = filters,
+            options = options,
+            hadoopConf = hadoopConf,
+            // We're delegating to Spark to append partition values to every row only in cases
+            // when these corresponding partition-values are not persisted w/in the data file itself
+            appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
+          )
+          // Since partition values by default are omitted, and not persisted w/in data-files by Spark,
+          // data-file readers (such as [[ParquetFileFormat]]) have to inject partition values while reading
+          // the data. As such, actual full schema produced by such reader is composed of
+          //    a) Data-file schema (projected or not)
+          //    b) Appended partition column values
+          val readerSchema = StructType(requiredDataSchema.structTypeSchema.fields ++ partitionSchema.fields)
+
+          (parquetReader, readerSchema)
 
       case HoodieFileFormat.HFILE =>
-        createHFileReader(
+        val hfileReader = createHFileReader(
           spark = spark,
           dataSchema = dataSchema,
-          requiredSchema = requiredSchema,
+          requiredDataSchema = requiredDataSchema,
           filters = filters,
           options = options,
           hadoopConf = hadoopConf
         )
 
+        (hfileReader, requiredDataSchema.structTypeSchema)
+
       case _ => throw new UnsupportedOperationException(s"Base file format is not currently supported ($tableBaseFileFormat)")
     }
 
-    partitionedFile => {
-      val extension = FSUtils.getFileExtension(partitionedFile.filePath)
-      if (tableBaseFileFormat.getFileExtension.equals(extension)) {
-        reader.apply(partitionedFile)
-      } else {
-        throw new UnsupportedOperationException(s"Invalid base-file format ($extension), expected ($tableBaseFileFormat)")

Review Comment:
   can we move the judgement into `HoodieDataSourceHelper.buildHoodieParquetReader` and `createHFileReader` separately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1188868431

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b21276b6bcab5b88ee7c01d428a107da4a2fc5ad Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1192151709

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e689b295bf78d07ec16ecad0da2956672987862a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161) 
   * 192e15ade1d6b8a291d003477b287bd7a5ef9e76 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1193159473

   @xiarixiaoyao unfortunately i don't think we will be able to do that in 0.12 -- there still quite a few optimizations that are predicated on `HadoopFsRelation` that we don't want to miss. 
   
   Migrating completely to DSv2 will help address these problems and allows us to avoid fallback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [WIP][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1140136956

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f228def49fd0a72d39d857e08046151fe1b512b5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r927259929


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -564,42 +538,57 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
     //       we have to eagerly initialize all of the readers even though only one specific to the type
     //       of the file being read will be used. This is required to avoid serialization of the whole
     //       relation (containing file-index for ex) and passing it to the executor
-    val reader = tableBaseFileFormat match {
-      case HoodieFileFormat.PARQUET =>
-        HoodieDataSourceHelper.buildHoodieParquetReader(
-          sparkSession = spark,
-          dataSchema = dataSchema.structTypeSchema,
-          partitionSchema = partitionSchema,
-          requiredSchema = requiredSchema.structTypeSchema,
-          filters = filters,
-          options = options,
-          hadoopConf = hadoopConf,
-          // We're delegating to Spark to append partition values to every row only in cases
-          // when these corresponding partition-values are not persisted w/in the data file itself
-          appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
-        )
+    val (read: (PartitionedFile => Iterator[InternalRow]), schema: StructType) =
+      tableBaseFileFormat match {
+        case HoodieFileFormat.PARQUET =>
+          (
+            HoodieDataSourceHelper.buildHoodieParquetReader(
+              sparkSession = spark,
+              dataSchema = dataSchema.structTypeSchema,
+              partitionSchema = partitionSchema,
+              requiredSchema = requiredSchema.structTypeSchema,
+              filters = filters,
+              options = options,
+              hadoopConf = hadoopConf,
+              // We're delegating to Spark to append partition values to every row only in cases
+              // when these corresponding partition-values are not persisted w/in the data file itself
+              appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
+            ),
+            // Since partition values by default are omitted, and not persisted w/in data-files by Spark,
+            // data-file readers (such as [[ParquetFileFormat]]) have to inject partition values while reading
+            // the data. As such, actual full schema produced by such reader is composed of
+            //    a) Prepended partition column values
+            //    b) Data-file schema (projected or not)
+            StructType(partitionSchema.fields ++ requiredSchema.structTypeSchema.fields)

Review Comment:
   It was actually a typo -- partition columns are actually appended (Spark's ParquetFileFormat has it incorrectly [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L237)



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -564,42 +538,57 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
     //       we have to eagerly initialize all of the readers even though only one specific to the type
     //       of the file being read will be used. This is required to avoid serialization of the whole
     //       relation (containing file-index for ex) and passing it to the executor
-    val reader = tableBaseFileFormat match {
-      case HoodieFileFormat.PARQUET =>
-        HoodieDataSourceHelper.buildHoodieParquetReader(
-          sparkSession = spark,
-          dataSchema = dataSchema.structTypeSchema,
-          partitionSchema = partitionSchema,
-          requiredSchema = requiredSchema.structTypeSchema,
-          filters = filters,
-          options = options,
-          hadoopConf = hadoopConf,
-          // We're delegating to Spark to append partition values to every row only in cases
-          // when these corresponding partition-values are not persisted w/in the data file itself
-          appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
-        )
+    val (read: (PartitionedFile => Iterator[InternalRow]), schema: StructType) =
+      tableBaseFileFormat match {
+        case HoodieFileFormat.PARQUET =>
+          (
+            HoodieDataSourceHelper.buildHoodieParquetReader(
+              sparkSession = spark,
+              dataSchema = dataSchema.structTypeSchema,
+              partitionSchema = partitionSchema,
+              requiredSchema = requiredSchema.structTypeSchema,
+              filters = filters,
+              options = options,
+              hadoopConf = hadoopConf,
+              // We're delegating to Spark to append partition values to every row only in cases
+              // when these corresponding partition-values are not persisted w/in the data file itself
+              appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
+            ),
+            // Since partition values by default are omitted, and not persisted w/in data-files by Spark,
+            // data-file readers (such as [[ParquetFileFormat]]) have to inject partition values while reading
+            // the data. As such, actual full schema produced by such reader is composed of
+            //    a) Prepended partition column values
+            //    b) Data-file schema (projected or not)
+            StructType(partitionSchema.fields ++ requiredSchema.structTypeSchema.fields)

Review Comment:
   It was actually a typo -- partition columns are actually appended (Spark's ParquetFileFormat has it incorrectly [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L237))



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1193012753

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183",
       "triggerID" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e121b002945d5aad30b00893dbb00a706be2ebba",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e121b002945d5aad30b00893dbb00a706be2ebba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 192e15ade1d6b8a291d003477b287bd7a5ef9e76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183) 
   * e121b002945d5aad30b00893dbb00a706be2ebba UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1190990676

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96de48522261c88ab79bdd04f75b41808f9d3f44 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [WIP][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1140174600

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f228def49fd0a72d39d857e08046151fe1b512b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981) 
   * 4d99164bb73769fedbb62c7a3688146f6cbd92c7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [WIP][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1140141760

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f228def49fd0a72d39d857e08046151fe1b512b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1191706979

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 96de48522261c88ab79bdd04f75b41808f9d3f44 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111) 
   * db793c0aa2e3db4a114d4e86a0249b3bf36188d2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1188619927

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4d99164bb73769fedbb62c7a3688146f6cbd92c7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986) 
   * b21276b6bcab5b88ee7c01d428a107da4a2fc5ad UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r928122745


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -645,17 +642,45 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
 
 object HoodieBaseRelation extends SparkAdapterSupport {
 
-  type BaseFileReader = PartitionedFile => Iterator[InternalRow]
+  case class BaseFileReader(read: PartitionedFile => Iterator[InternalRow], val schema: StructType) {
+    def apply(file: PartitionedFile): Iterator[InternalRow] = read.apply(file)
+  }
 
-  private def generateUnsafeProjection(from: StructType, to: StructType) =
-    sparkAdapter.getCatalystExpressionUtils().generateUnsafeProjection(from, to)
+  def generateUnsafeProjection(from: StructType, to: StructType): UnsafeProjection =
+    sparkAdapter.getCatalystExpressionUtils.generateUnsafeProjection(from, to)
 
   def convertToAvroSchema(structSchema: StructType): Schema =
     sparkAdapter.getAvroSchemaConverters.toAvroType(structSchema, nullable = false, "Record")
 
   def getPartitionPath(fileStatus: FileStatus): Path =
     fileStatus.getPath.getParent
 
+  /**
+   * Projects provided file reader's output from its original schema, into a [[requiredSchema]]
+   *
+   * NOTE: [[requiredSchema]] has to be a proper subset of the file reader's schema
+   *
+   * @param reader file reader to be projected
+   * @param requiredSchema target schema for the output of the provided file reader
+   */
+  def projectReader(reader: BaseFileReader, requiredSchema: StructType): BaseFileReader = {
+    checkState(reader.schema.fields.toSet.intersect(requiredSchema.fields.toSet).size == requiredSchema.size)
+
+    if (reader.schema == requiredSchema) {
+      reader
+    } else {
+      val read = reader.apply(_)
+      val projectedRead: PartitionedFile => Iterator[InternalRow] = (file: PartitionedFile) => {
+        // NOTE: Projection is not a serializable object, hence it creation should only happen w/in
+        //       the executor process
+        val unsafeProjection = generateUnsafeProjection(reader.schema, requiredSchema)
+        read(file).map(unsafeProjection)
+      }
+
+      BaseFileReader(projectedRead, requiredSchema)

Review Comment:
   maybe we no need pass requiredSchema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r928145861


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -274,7 +274,7 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
   def canPruneRelationSchema: Boolean =
     (fileFormat.isInstanceOf[ParquetFileFormat] || fileFormat.isInstanceOf[OrcFileFormat]) &&
       // NOTE: Some relations might be disabling sophisticated schema pruning techniques (for ex, nested schema pruning)
-      // TODO(HUDI-XXX) internal schema doesn't supported nested schema pruning currently
+      // TODO(HUDI-XXX) internal schema doesn't support nested schema pruning currently

Review Comment:
   You mean, JIRA, right? Will do



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -645,17 +642,45 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
 
 object HoodieBaseRelation extends SparkAdapterSupport {
 
-  type BaseFileReader = PartitionedFile => Iterator[InternalRow]
+  case class BaseFileReader(read: PartitionedFile => Iterator[InternalRow], val schema: StructType) {
+    def apply(file: PartitionedFile): Iterator[InternalRow] = read.apply(file)
+  }
 
-  private def generateUnsafeProjection(from: StructType, to: StructType) =
-    sparkAdapter.getCatalystExpressionUtils().generateUnsafeProjection(from, to)
+  def generateUnsafeProjection(from: StructType, to: StructType): UnsafeProjection =
+    sparkAdapter.getCatalystExpressionUtils.generateUnsafeProjection(from, to)
 
   def convertToAvroSchema(structSchema: StructType): Schema =
     sparkAdapter.getAvroSchemaConverters.toAvroType(structSchema, nullable = false, "Record")
 
   def getPartitionPath(fileStatus: FileStatus): Path =
     fileStatus.getPath.getParent
 
+  /**
+   * Projects provided file reader's output from its original schema, into a [[requiredSchema]]
+   *
+   * NOTE: [[requiredSchema]] has to be a proper subset of the file reader's schema
+   *
+   * @param reader file reader to be projected
+   * @param requiredSchema target schema for the output of the provided file reader
+   */
+  def projectReader(reader: BaseFileReader, requiredSchema: StructType): BaseFileReader = {
+    checkState(reader.schema.fields.toSet.intersect(requiredSchema.fields.toSet).size == requiredSchema.size)
+
+    if (reader.schema == requiredSchema) {
+      reader
+    } else {
+      val read = reader.apply(_)
+      val projectedRead: PartitionedFile => Iterator[InternalRow] = (file: PartitionedFile) => {
+        // NOTE: Projection is not a serializable object, hence it creation should only happen w/in
+        //       the executor process
+        val unsafeProjection = generateUnsafeProjection(reader.schema, requiredSchema)
+        read(file).map(unsafeProjection)
+      }
+
+      BaseFileReader(projectedRead, requiredSchema)

Review Comment:
   Please check my comment where this method is used for an example: whenever we prune partition columns, ordering of the columns would change (partition ones will be removed and then appended to the resulting schema) therefore without projecting back into required schema caller will get dataset that will have incorrect ordering of the columns



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1191895568

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * db793c0aa2e3db4a114d4e86a0249b3bf36188d2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152) 
   * e689b295bf78d07ec16ecad0da2956672987862a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [WIP][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1140136384

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f228def49fd0a72d39d857e08046151fe1b512b5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1191985695

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e689b295bf78d07ec16ecad0da2956672987862a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1191823147

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * db793c0aa2e3db4a114d4e86a0249b3bf36188d2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1193014281

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183",
       "triggerID" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e121b002945d5aad30b00893dbb00a706be2ebba",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10225",
       "triggerID" : "e121b002945d5aad30b00893dbb00a706be2ebba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 192e15ade1d6b8a291d003477b287bd7a5ef9e76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183) 
   * e121b002945d5aad30b00893dbb00a706be2ebba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10225) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
YannByron commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r928123957


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -564,42 +538,56 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
     //       we have to eagerly initialize all of the readers even though only one specific to the type
     //       of the file being read will be used. This is required to avoid serialization of the whole
     //       relation (containing file-index for ex) and passing it to the executor
-    val reader = tableBaseFileFormat match {
-      case HoodieFileFormat.PARQUET =>
-        HoodieDataSourceHelper.buildHoodieParquetReader(
-          sparkSession = spark,
-          dataSchema = dataSchema.structTypeSchema,
-          partitionSchema = partitionSchema,
-          requiredSchema = requiredSchema.structTypeSchema,
-          filters = filters,
-          options = options,
-          hadoopConf = hadoopConf,
-          // We're delegating to Spark to append partition values to every row only in cases
-          // when these corresponding partition-values are not persisted w/in the data file itself
-          appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
-        )
+    val (read: (PartitionedFile => Iterator[InternalRow]), schema: StructType) =
+      tableBaseFileFormat match {
+        case HoodieFileFormat.PARQUET =>
+          val parquetReader = HoodieDataSourceHelper.buildHoodieParquetReader(
+            sparkSession = spark,
+            dataSchema = dataSchema.structTypeSchema,
+            partitionSchema = partitionSchema,
+            requiredSchema = requiredDataSchema.structTypeSchema,
+            filters = filters,
+            options = options,
+            hadoopConf = hadoopConf,
+            // We're delegating to Spark to append partition values to every row only in cases
+            // when these corresponding partition-values are not persisted w/in the data file itself
+            appendPartitionValues = shouldExtractPartitionValuesFromPartitionPath
+          )
+          // Since partition values by default are omitted, and not persisted w/in data-files by Spark,
+          // data-file readers (such as [[ParquetFileFormat]]) have to inject partition values while reading
+          // the data. As such, actual full schema produced by such reader is composed of
+          //    a) Data-file schema (projected or not)
+          //    b) Appended partition column values
+          val readerSchema = StructType(requiredDataSchema.structTypeSchema.fields ++ partitionSchema.fields)
+
+          (parquetReader, readerSchema)
 
       case HoodieFileFormat.HFILE =>
-        createHFileReader(
+        val hfileReader = createHFileReader(
           spark = spark,
           dataSchema = dataSchema,
-          requiredSchema = requiredSchema,
+          requiredDataSchema = requiredDataSchema,
           filters = filters,
           options = options,
           hadoopConf = hadoopConf
         )
 
+        (hfileReader, requiredDataSchema.structTypeSchema)
+
       case _ => throw new UnsupportedOperationException(s"Base file format is not currently supported ($tableBaseFileFormat)")
     }
 
-    partitionedFile => {
-      val extension = FSUtils.getFileExtension(partitionedFile.filePath)
-      if (tableBaseFileFormat.getFileExtension.equals(extension)) {
-        reader.apply(partitionedFile)
-      } else {
-        throw new UnsupportedOperationException(s"Invalid base-file format ($extension), expected ($tableBaseFileFormat)")

Review Comment:
   can we move the judgement into `HoodieDataSourceHelper.buildHoodieParquetReader` and `createHFileReader` separately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1188622819

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4d99164bb73769fedbb62c7a3688146f6cbd92c7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986) 
   * b21276b6bcab5b88ee7c01d428a107da4a2fc5ad Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1191935604

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * db793c0aa2e3db4a114d4e86a0249b3bf36188d2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152) 
   * e689b295bf78d07ec16ecad0da2956672987862a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1192153606

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10076",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10111",
       "triggerID" : "96de48522261c88ab79bdd04f75b41808f9d3f44",
       "triggerType" : "PUSH"
     }, {
       "hash" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10152",
       "triggerID" : "db793c0aa2e3db4a114d4e86a0249b3bf36188d2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161",
       "triggerID" : "e689b295bf78d07ec16ecad0da2956672987862a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183",
       "triggerID" : "192e15ade1d6b8a291d003477b287bd7a5ef9e76",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e689b295bf78d07ec16ecad0da2956672987862a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10161) 
   * 192e15ade1d6b8a291d003477b287bd7a5ef9e76 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10183) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1189679501

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055",
       "triggerID" : "b21276b6bcab5b88ee7c01d428a107da4a2fc5ad",
       "triggerType" : "PUSH"
     }, {
       "hash" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "372c9b452bf9894c544b77a2798c5581bceab48e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b21276b6bcab5b88ee7c01d428a107da4a2fc5ad Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10055) 
   * 372c9b452bf9894c544b77a2798c5581bceab48e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1193125585

   @alexeykudinkin 
   a little  question: 
   for cow table, only when schemaOnRead=true, we will use BaseFileOnlyRelation, Otherwise, we will fall back to hadoopfsrelation。 now with this pr, Is it possible for us to remove the fallback logical.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1193158395

   CI is finally green:
   https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10225&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on pull request #5708: [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
xushiyan commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1193194069

   ![Screen Shot 2022-07-23 at 4 54 16 PM](https://user-images.githubusercontent.com/2701446/180624069-982e8fc4-6c8d-4fe5-9e55-d3cdef0aa74d.png)
   CI passed. landing now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5708: [HUDI-4420][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #5708:
URL: https://github.com/apache/hudi/pull/5708#discussion_r928122799


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -645,17 +642,45 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
 
 object HoodieBaseRelation extends SparkAdapterSupport {
 
-  type BaseFileReader = PartitionedFile => Iterator[InternalRow]
+  case class BaseFileReader(read: PartitionedFile => Iterator[InternalRow], val schema: StructType) {
+    def apply(file: PartitionedFile): Iterator[InternalRow] = read.apply(file)
+  }
 
-  private def generateUnsafeProjection(from: StructType, to: StructType) =
-    sparkAdapter.getCatalystExpressionUtils().generateUnsafeProjection(from, to)
+  def generateUnsafeProjection(from: StructType, to: StructType): UnsafeProjection =
+    sparkAdapter.getCatalystExpressionUtils.generateUnsafeProjection(from, to)
 
   def convertToAvroSchema(structSchema: StructType): Schema =
     sparkAdapter.getAvroSchemaConverters.toAvroType(structSchema, nullable = false, "Record")
 
   def getPartitionPath(fileStatus: FileStatus): Path =
     fileStatus.getPath.getParent
 
+  /**
+   * Projects provided file reader's output from its original schema, into a [[requiredSchema]]
+   *
+   * NOTE: [[requiredSchema]] has to be a proper subset of the file reader's schema
+   *
+   * @param reader file reader to be projected
+   * @param requiredSchema target schema for the output of the provided file reader
+   */
+  def projectReader(reader: BaseFileReader, requiredSchema: StructType): BaseFileReader = {
+    checkState(reader.schema.fields.toSet.intersect(requiredSchema.fields.toSet).size == requiredSchema.size)
+
+    if (reader.schema == requiredSchema) {
+      reader
+    } else {
+      val read = reader.apply(_)
+      val projectedRead: PartitionedFile => Iterator[InternalRow] = (file: PartitionedFile) => {
+        // NOTE: Projection is not a serializable object, hence it creation should only happen w/in
+        //       the executor process
+        val unsafeProjection = generateUnsafeProjection(reader.schema, requiredSchema)
+        read(file).map(unsafeProjection)
+      }
+
+      BaseFileReader(projectedRead, requiredSchema)

Review Comment:
   why we still need requriedSchema?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5708: [WIP][Stacked on 5430] Fixing table schema delineation on partition/data schema for Spark relations

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5708:
URL: https://github.com/apache/hudi/pull/5708#issuecomment-1140175097

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981",
       "triggerID" : "f228def49fd0a72d39d857e08046151fe1b512b5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986",
       "triggerID" : "4d99164bb73769fedbb62c7a3688146f6cbd92c7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f228def49fd0a72d39d857e08046151fe1b512b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8981) 
   * 4d99164bb73769fedbb62c7a3688146f6cbd92c7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8986) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org