You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "jonvex (via GitHub)" <gi...@apache.org> on 2023/02/15 22:00:21 UTC

[GitHub] [hudi] jonvex opened a new pull request, #7974: add validation to HoodieAvroUtils

jonvex opened a new pull request, #7974:
URL: https://github.com/apache/hudi/pull/7974

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7974: add validation to HoodieAvroUtils

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7974:
URL: https://github.com/apache/hudi/pull/7974#issuecomment-1448741857

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255",
       "triggerID" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15490",
       "triggerID" : "fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255) 
   * fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15490) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7974: add validation to HoodieAvroUtils

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on code in PR #7974:
URL: https://github.com/apache/hudi/pull/7974#discussion_r1120539498


##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala:
##########
@@ -103,7 +103,7 @@ object HoodieSparkUtils extends SparkAdapterSupport with SparkVersionsSupport {
         val transform: GenericRecord => GenericRecord =
           if (sameSchema) identity
           else {
-            HoodieAvroUtils.rewriteRecordDeep(_, readerAvroSchema)
+            HoodieAvroUtils.rewriteRecordDeep(_, readerAvroSchema, true)

Review Comment:
   this is in the fast path and would like us to be cautious in doing any per record validations in general. We do have schema validations already. So, trying to wonder if this is really necessary? and why not rely on schema validations and not do per record validations. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jonvex commented on a diff in pull request #7974: add validation to HoodieAvroUtils

Posted by "jonvex (via GitHub)" <gi...@apache.org>.

jonvex commented on code in PR #7974:
URL: https://github.com/apache/hudi/pull/7974#discussion_r1108941277


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkUtils.scala:
##########
@@ -132,81 +136,80 @@ class TestHoodieSparkUtils {
       .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
       .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
       .getOrCreate
-
-    val innerStruct1 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-    val structType1 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct1,true)
-    val schema1 = AvroConversionUtils.convertStructTypeToAvroSchema(structType1, "test_struct_name", "test_namespace")
-    val records1 = Seq(Row("key1", Row("innerKey1_1", 1L), Row("innerKey1_2", 2L)))
-
-    val df1 = spark.createDataFrame(spark.sparkContext.parallelize(records1), structType1)
-    val genRecRDD1 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema1))
-    assert(schema1.equals(genRecRDD1.collect()(0).getSchema))
-
-    // create schema2 which has one addition column at the root level compared to schema1
-    val structType2 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct1,true)
-      .add("nullableInnerStruct2",innerStruct1,true)
-    val schema2 = AvroConversionUtils.convertStructTypeToAvroSchema(structType2, "test_struct_name", "test_namespace")
-    val records2 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L), Row("innerKey2_3", 2L)))
-    val df2 = spark.createDataFrame(spark.sparkContext.parallelize(records2), structType2)
-    val genRecRDD2 = HoodieSparkUtils.createRdd(df2, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema2))
-    assert(schema2.equals(genRecRDD2.collect()(0).getSchema))
-
-    // send records1 with schema2. should succeed since the new column is nullable.
-    val genRecRDD3 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema2))
-    assert(genRecRDD3.collect()(0).getSchema.equals(schema2))
-    genRecRDD3.foreach(entry => assertNull(entry.get("nullableInnerStruct2")))
-
-    val innerStruct3 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-      .add("new_nested_col","string",true)
-
-    // create a schema which has one additional nested column compared to schema1, which is nullable
-    val structType4 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct3,true)
-
-    val schema4 = AvroConversionUtils.convertStructTypeToAvroSchema(structType4, "test_struct_name", "test_namespace")
-    val records4 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L, "new_nested_col_val1")))
-    val df4 = spark.createDataFrame(spark.sparkContext.parallelize(records4), structType4)
-    val genRecRDD4 = HoodieSparkUtils.createRdd(df4, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema4))
-    assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
-
-    // convert batch 1 with schema4. should succeed.
-    val genRecRDD5 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema4))
-    assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
-    val genRec = genRecRDD5.collect()(0)
-    val nestedRec : GenericRecord = genRec.get("nullableInnerStruct").asInstanceOf[GenericRecord]
-    assertNull(nestedRec.get("new_nested_col"))
-    assertNotNull(nestedRec.get("innerKey"))
-    assertNotNull(nestedRec.get("innerValue"))
-
-    val innerStruct4 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-      .add("new_nested_col","string",false)
-    // create a schema which has one additional nested column compared to schema1, which is non nullable
-    val structType6 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct4,true)
-
-    val schema6 = AvroConversionUtils.convertStructTypeToAvroSchema(structType6, "test_struct_name", "test_namespace")
-    // convert batch 1 with schema5. should fail since the missed out column is not nullable.
     try {
-      val genRecRDD6 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-        org.apache.hudi.common.util.Option.of(schema6))
-      genRecRDD6.collect()
-      fail("createRdd should fail, because records don't have a column which is not nullable in the passed in schema")
-    } catch {
-      case e: Exception =>

Review Comment:
   Actually changed stuff in this Exception case block



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] add validation to HoodieAvroUtils [hudi]

Posted by "jonvex (via GitHub)" <gi...@apache.org>.

jonvex closed pull request #7974: add validation to HoodieAvroUtils
URL: https://github.com/apache/hudi/pull/7974


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7974: add validation to HoodieAvroUtils

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7974:
URL: https://github.com/apache/hudi/pull/7974#issuecomment-1433628077

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255",
       "triggerID" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7974: add validation to HoodieAvroUtils

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7974:
URL: https://github.com/apache/hudi/pull/7974#issuecomment-1433620373

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jonvex commented on a diff in pull request #7974: add validation to HoodieAvroUtils

Posted by "jonvex (via GitHub)" <gi...@apache.org>.

jonvex commented on code in PR #7974:
URL: https://github.com/apache/hudi/pull/7974#discussion_r1108939360


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkUtils.scala:
##########
@@ -95,32 +95,36 @@ class TestHoodieSparkUtils {
       .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
       .getOrCreate
 
-    val schema = DataSourceTestUtils.getStructTypeExampleSchema

Review Comment:
   copy and pasted everything into try finally because it doesn't stop spark on failure so it was causing all the other tests to fail due to trying to run 2 sparks at the same time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jonvex commented on a diff in pull request #7974: add validation to HoodieAvroUtils

Posted by "jonvex (via GitHub)" <gi...@apache.org>.

jonvex commented on code in PR #7974:
URL: https://github.com/apache/hudi/pull/7974#discussion_r1108940571


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkUtils.scala:
##########
@@ -132,81 +136,80 @@ class TestHoodieSparkUtils {
       .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
       .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
       .getOrCreate

Review Comment:
   put everything in try finally, but there is a change here, marked below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jonvex commented on a diff in pull request #7974: add validation to HoodieAvroUtils

Posted by "jonvex (via GitHub)" <gi...@apache.org>.

jonvex commented on code in PR #7974:
URL: https://github.com/apache/hudi/pull/7974#discussion_r1108941904


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkUtils.scala:
##########
@@ -132,81 +136,80 @@ class TestHoodieSparkUtils {
       .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
       .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
       .getOrCreate
-
-    val innerStruct1 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-    val structType1 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct1,true)
-    val schema1 = AvroConversionUtils.convertStructTypeToAvroSchema(structType1, "test_struct_name", "test_namespace")
-    val records1 = Seq(Row("key1", Row("innerKey1_1", 1L), Row("innerKey1_2", 2L)))
-
-    val df1 = spark.createDataFrame(spark.sparkContext.parallelize(records1), structType1)
-    val genRecRDD1 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema1))
-    assert(schema1.equals(genRecRDD1.collect()(0).getSchema))
-
-    // create schema2 which has one addition column at the root level compared to schema1
-    val structType2 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct1,true)
-      .add("nullableInnerStruct2",innerStruct1,true)
-    val schema2 = AvroConversionUtils.convertStructTypeToAvroSchema(structType2, "test_struct_name", "test_namespace")
-    val records2 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L), Row("innerKey2_3", 2L)))
-    val df2 = spark.createDataFrame(spark.sparkContext.parallelize(records2), structType2)
-    val genRecRDD2 = HoodieSparkUtils.createRdd(df2, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema2))
-    assert(schema2.equals(genRecRDD2.collect()(0).getSchema))
-
-    // send records1 with schema2. should succeed since the new column is nullable.
-    val genRecRDD3 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema2))
-    assert(genRecRDD3.collect()(0).getSchema.equals(schema2))
-    genRecRDD3.foreach(entry => assertNull(entry.get("nullableInnerStruct2")))
-
-    val innerStruct3 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-      .add("new_nested_col","string",true)
-
-    // create a schema which has one additional nested column compared to schema1, which is nullable
-    val structType4 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct3,true)
-
-    val schema4 = AvroConversionUtils.convertStructTypeToAvroSchema(structType4, "test_struct_name", "test_namespace")
-    val records4 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L, "new_nested_col_val1")))
-    val df4 = spark.createDataFrame(spark.sparkContext.parallelize(records4), structType4)
-    val genRecRDD4 = HoodieSparkUtils.createRdd(df4, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema4))
-    assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
-
-    // convert batch 1 with schema4. should succeed.
-    val genRecRDD5 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema4))
-    assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
-    val genRec = genRecRDD5.collect()(0)
-    val nestedRec : GenericRecord = genRec.get("nullableInnerStruct").asInstanceOf[GenericRecord]
-    assertNull(nestedRec.get("new_nested_col"))
-    assertNotNull(nestedRec.get("innerKey"))
-    assertNotNull(nestedRec.get("innerValue"))
-
-    val innerStruct4 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-      .add("new_nested_col","string",false)
-    // create a schema which has one additional nested column compared to schema1, which is non nullable
-    val structType6 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct4,true)
-
-    val schema6 = AvroConversionUtils.convertStructTypeToAvroSchema(structType6, "test_struct_name", "test_namespace")
-    // convert batch 1 with schema5. should fail since the missed out column is not nullable.
     try {
-      val genRecRDD6 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-        org.apache.hudi.common.util.Option.of(schema6))
-      genRecRDD6.collect()
-      fail("createRdd should fail, because records don't have a column which is not nullable in the passed in schema")
-    } catch {
-      case e: Exception =>
-        if (HoodieSparkUtils.gteqSpark3_3) {
-          assertTrue(e.getMessage.contains("null value for (non-nullable) string at test_struct_name.nullableInnerStruct[nullableInnerStruct].new_nested_col"))
-        } else {
-          assertTrue(e.getMessage.contains("null of string in field new_nested_col of test_namespace.test_struct_name.nullableInnerStruct of union"))
-        }
+      val innerStruct1 = new StructType().add("innerKey", "string", false).add("innerValue", "long", true)
+      val structType1 = new StructType().add("key", "string", false)
+        .add("nonNullableInnerStruct", innerStruct1, false).add("nullableInnerStruct", innerStruct1, true)
+      val schema1 = AvroConversionUtils.convertStructTypeToAvroSchema(structType1, "test_struct_name", "test_namespace")
+      val records1 = Seq(Row("key1", Row("innerKey1_1", 1L), Row("innerKey1_2", 2L)))
+
+      val df1 = spark.createDataFrame(spark.sparkContext.parallelize(records1), structType1)
+      val genRecRDD1 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
+        org.apache.hudi.common.util.Option.of(schema1))
+      assert(schema1.equals(genRecRDD1.collect()(0).getSchema))
+
+      // create schema2 which has one addition column at the root level compared to schema1
+      val structType2 = new StructType().add("key", "string", false)
+        .add("nonNullableInnerStruct", innerStruct1, false).add("nullableInnerStruct", innerStruct1, true)
+        .add("nullableInnerStruct2", innerStruct1, true)
+      val schema2 = AvroConversionUtils.convertStructTypeToAvroSchema(structType2, "test_struct_name", "test_namespace")
+      val records2 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L), Row("innerKey2_3", 2L)))
+      val df2 = spark.createDataFrame(spark.sparkContext.parallelize(records2), structType2)
+      val genRecRDD2 = HoodieSparkUtils.createRdd(df2, "test_struct_name", "test_namespace", true,
+        org.apache.hudi.common.util.Option.of(schema2))
+      assert(schema2.equals(genRecRDD2.collect()(0).getSchema))
+
+      // send records1 with schema2. should succeed since the new column is nullable.
+      val genRecRDD3 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
+        org.apache.hudi.common.util.Option.of(schema2))
+      assert(genRecRDD3.collect()(0).getSchema.equals(schema2))
+      genRecRDD3.foreach(entry => assertNull(entry.get("nullableInnerStruct2")))
+
+      val innerStruct3 = new StructType().add("innerKey", "string", false).add("innerValue", "long", true)
+        .add("new_nested_col", "string", true)
+
+      // create a schema which has one additional nested column compared to schema1, which is nullable
+      val structType4 = new StructType().add("key", "string", false)
+        .add("nonNullableInnerStruct", innerStruct1, false).add("nullableInnerStruct", innerStruct3, true)
+
+      val schema4 = AvroConversionUtils.convertStructTypeToAvroSchema(structType4, "test_struct_name", "test_namespace")
+      val records4 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L, "new_nested_col_val1")))
+      val df4 = spark.createDataFrame(spark.sparkContext.parallelize(records4), structType4)
+      val genRecRDD4 = HoodieSparkUtils.createRdd(df4, "test_struct_name", "test_namespace", true,
+        org.apache.hudi.common.util.Option.of(schema4))
+      assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
+
+      // convert batch 1 with schema4. should succeed.
+      val genRecRDD5 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
+        org.apache.hudi.common.util.Option.of(schema4))
+      assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
+      val genRec = genRecRDD5.collect()(0)
+      val nestedRec: GenericRecord = genRec.get("nullableInnerStruct").asInstanceOf[GenericRecord]
+      assertNull(nestedRec.get("new_nested_col"))
+      assertNotNull(nestedRec.get("innerKey"))
+      assertNotNull(nestedRec.get("innerValue"))
+
+      val innerStruct4 = new StructType().add("innerKey", "string", false).add("innerValue", "long", true)
+        .add("new_nested_col", "string", false)
+      // create a schema which has one additional nested column compared to schema1, which is non nullable
+      val structType6 = new StructType().add("key", "string", false)
+        .add("nonNullableInnerStruct", innerStruct1, false).add("nullableInnerStruct", innerStruct4, true)
+
+      val schema6 = AvroConversionUtils.convertStructTypeToAvroSchema(structType6, "test_struct_name", "test_namespace")
+      // convert batch 1 with schema5. should fail since the missed out column is not nullable.
+      try {
+        val genRecRDD6 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
+          org.apache.hudi.common.util.Option.of(schema6))
+        genRecRDD6.collect()
+        fail("createRdd should fail, because records don't have a column which is not nullable in the passed in schema")
+      } catch {
+        case e: Exception =>

Review Comment:
   This is what I changed it to



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jonvex commented on a diff in pull request #7974: add validation to HoodieAvroUtils

Posted by "jonvex (via GitHub)" <gi...@apache.org>.

jonvex commented on code in PR #7974:
URL: https://github.com/apache/hudi/pull/7974#discussion_r1108941277


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkUtils.scala:
##########
@@ -132,81 +136,80 @@ class TestHoodieSparkUtils {
       .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
       .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
       .getOrCreate
-
-    val innerStruct1 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-    val structType1 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct1,true)
-    val schema1 = AvroConversionUtils.convertStructTypeToAvroSchema(structType1, "test_struct_name", "test_namespace")
-    val records1 = Seq(Row("key1", Row("innerKey1_1", 1L), Row("innerKey1_2", 2L)))
-
-    val df1 = spark.createDataFrame(spark.sparkContext.parallelize(records1), structType1)
-    val genRecRDD1 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema1))
-    assert(schema1.equals(genRecRDD1.collect()(0).getSchema))
-
-    // create schema2 which has one addition column at the root level compared to schema1
-    val structType2 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct1,true)
-      .add("nullableInnerStruct2",innerStruct1,true)
-    val schema2 = AvroConversionUtils.convertStructTypeToAvroSchema(structType2, "test_struct_name", "test_namespace")
-    val records2 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L), Row("innerKey2_3", 2L)))
-    val df2 = spark.createDataFrame(spark.sparkContext.parallelize(records2), structType2)
-    val genRecRDD2 = HoodieSparkUtils.createRdd(df2, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema2))
-    assert(schema2.equals(genRecRDD2.collect()(0).getSchema))
-
-    // send records1 with schema2. should succeed since the new column is nullable.
-    val genRecRDD3 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema2))
-    assert(genRecRDD3.collect()(0).getSchema.equals(schema2))
-    genRecRDD3.foreach(entry => assertNull(entry.get("nullableInnerStruct2")))
-
-    val innerStruct3 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-      .add("new_nested_col","string",true)
-
-    // create a schema which has one additional nested column compared to schema1, which is nullable
-    val structType4 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct3,true)
-
-    val schema4 = AvroConversionUtils.convertStructTypeToAvroSchema(structType4, "test_struct_name", "test_namespace")
-    val records4 = Seq(Row("key2", Row("innerKey2_1", 2L), Row("innerKey2_2", 2L, "new_nested_col_val1")))
-    val df4 = spark.createDataFrame(spark.sparkContext.parallelize(records4), structType4)
-    val genRecRDD4 = HoodieSparkUtils.createRdd(df4, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema4))
-    assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
-
-    // convert batch 1 with schema4. should succeed.
-    val genRecRDD5 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-      org.apache.hudi.common.util.Option.of(schema4))
-    assert(schema4.equals(genRecRDD4.collect()(0).getSchema))
-    val genRec = genRecRDD5.collect()(0)
-    val nestedRec : GenericRecord = genRec.get("nullableInnerStruct").asInstanceOf[GenericRecord]
-    assertNull(nestedRec.get("new_nested_col"))
-    assertNotNull(nestedRec.get("innerKey"))
-    assertNotNull(nestedRec.get("innerValue"))
-
-    val innerStruct4 = new StructType().add("innerKey","string",false).add("innerValue", "long", true)
-      .add("new_nested_col","string",false)
-    // create a schema which has one additional nested column compared to schema1, which is non nullable
-    val structType6 = new StructType().add("key", "string", false)
-      .add("nonNullableInnerStruct",innerStruct1,false).add("nullableInnerStruct",innerStruct4,true)
-
-    val schema6 = AvroConversionUtils.convertStructTypeToAvroSchema(structType6, "test_struct_name", "test_namespace")
-    // convert batch 1 with schema5. should fail since the missed out column is not nullable.
     try {
-      val genRecRDD6 = HoodieSparkUtils.createRdd(df1, "test_struct_name", "test_namespace", true,
-        org.apache.hudi.common.util.Option.of(schema6))
-      genRecRDD6.collect()
-      fail("createRdd should fail, because records don't have a column which is not nullable in the passed in schema")
-    } catch {
-      case e: Exception =>

Review Comment:
   Actually changed stuff in this block



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jonvex commented on a diff in pull request #7974: add validation to HoodieAvroUtils

Posted by "jonvex (via GitHub)" <gi...@apache.org>.

jonvex commented on code in PR #7974:
URL: https://github.com/apache/hudi/pull/7974#discussion_r1108939360


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkUtils.scala:
##########
@@ -95,32 +95,36 @@ class TestHoodieSparkUtils {
       .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
       .getOrCreate
 
-    val schema = DataSourceTestUtils.getStructTypeExampleSchema

Review Comment:
   copy and pasted everything into try finally



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7974: add validation to HoodieAvroUtils

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7974:
URL: https://github.com/apache/hudi/pull/7974#issuecomment-1434165464

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255",
       "triggerID" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7974: add validation to HoodieAvroUtils

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7974:
URL: https://github.com/apache/hudi/pull/7974#issuecomment-1449280253

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255",
       "triggerID" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15490",
       "triggerID" : "fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15490) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7974: add validation to HoodieAvroUtils

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7974:
URL: https://github.com/apache/hudi/pull/7974#issuecomment-1448692152

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255",
       "triggerID" : "d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d44437d0827c6b55c2bdb0173bf2f5ec06a6aa94 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15255) 
   * fb75ad5aa5dfb5c91d8c56f987964c6dd98b666c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org