You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/18 12:04:56 UTC

[GitHub] [hudi] codope opened a new pull request, #5347: [HUDI-3707] Fix deltastreamer test with schema provider and transformer

codope opened a new pull request, #5347:
URL: https://github.com/apache/hudi/pull/5347

   ## What is the purpose of the pull request
   
   When using custom transformers, user may not have turned on config to reconcile schema explicitly. In that case, the schema in the transformed dataset and that coming from a schema provider could differ. This PR fixes it by passing the latest schema if it is present, irrespective of whether `hoodie.datasource.write.reconcile.schema` is true or false.
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan merged pull request #5347: [HUDI-3707] Fix deltastreamer test with schema provider and transformer

Posted by GitBox <gi...@apache.org>.
nsivabalan merged PR #5347:
URL: https://github.com/apache/hudi/pull/5347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5347: [HUDI-3707] Fix deltastreamer test with schema provider and transformer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5347:
URL: https://github.com/apache/hudi/pull/5347#issuecomment-1101350055

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "c72a7f1af64141c4b6d2ca466be5c2af9a5ca774",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c72a7f1af64141c4b6d2ca466be5c2af9a5ca774",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c72a7f1af64141c4b6d2ca466be5c2af9a5ca774 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5347: [HUDI-3707] Fix deltastreamer test with schema provider and transformer

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5347:
URL: https://github.com/apache/hudi/pull/5347#discussion_r852166660


##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala:
##########
@@ -130,7 +129,15 @@ object HoodieSparkUtils extends SparkAdapterSupport {
    */
   def createRdd(df: DataFrame, structName: String, recordNamespace: String, reconcileToLatestSchema: Boolean,
                 latestTableSchema: org.apache.hudi.common.util.Option[Schema] = org.apache.hudi.common.util.Option.empty()): RDD[GenericRecord] = {
-    val latestTableSchemaConverted = if (latestTableSchema.isPresent && reconcileToLatestSchema) Some(latestTableSchema.get()) else None
+    var latestTableSchemaConverted : Option[Schema] = None
+
+    if (latestTableSchema.isPresent && reconcileToLatestSchema) {
+      latestTableSchemaConverted = Some(latestTableSchema.get())
+    } else {
+      // cases when users want to use latestTableSchema but have not turned on reconcileToLatestSchema explicitly
+      // for example, when using a Transformer implementation to transform source RDD to target RDD
+      latestTableSchemaConverted = if (latestTableSchema.isPresent) Some(latestTableSchema.get()) else None

Review Comment:
   looks like this was a regression in one of the refactorings. we had this in 0.10.1. 
   ```
       var writeSchema : Schema = null;
       var toReconcileSchema : Schema = null;
       if (reconcileToLatestSchema && latestTableSchema.isPresent) {
         // if reconcileToLatestSchema is set to true and latestSchema is present, then try to leverage latestTableSchema.
         // this code path will handle situations where records are serialized in odl schema, but callers wish to convert
         // to Rdd[GenericRecord] using different schema(could be evolved schema or could be latest table schema)
         writeSchema = dfWriteSchema
         toReconcileSchema = latestTableSchema.get()
       } else {
         // there are paths where callers wish to use latestTableSchema to convert to Rdd[GenericRecords] and not use
         // row's schema. So use latestTableSchema if present. if not available, fallback to using row's schema.
         writeSchema = if (latestTableSchema.isPresent) { latestTableSchema.get()} else { dfWriteSchema}
       }
       createRddInternal(df, writeSchema, toReconcileSchema, structName, recordNamespace)
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5347: [HUDI-3707] Fix deltastreamer test with schema provider and transformer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5347:
URL: https://github.com/apache/hudi/pull/5347#issuecomment-1101409827

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "c72a7f1af64141c4b6d2ca466be5c2af9a5ca774",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8103",
       "triggerID" : "c72a7f1af64141c4b6d2ca466be5c2af9a5ca774",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c72a7f1af64141c4b6d2ca466be5c2af9a5ca774 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8103) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5347: [HUDI-3707] Fix deltastreamer test with schema provider and transformer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5347:
URL: https://github.com/apache/hudi/pull/5347#issuecomment-1101351856

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "c72a7f1af64141c4b6d2ca466be5c2af9a5ca774",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8103",
       "triggerID" : "c72a7f1af64141c4b6d2ca466be5c2af9a5ca774",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c72a7f1af64141c4b6d2ca466be5c2af9a5ca774 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8103) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org