You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/25 10:48:47 UTC

[GitHub] [hudi] YannByron opened a new pull request, #6788: [HUDI-4915] improve avro serializer/deserializer

YannByron opened a new pull request, #6788:
URL: https://github.com/apache/hudi/pull/6788

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1257204147

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713",
       "triggerID" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0c3c94864a6523f0a72156cae3be1a3a70e54ef3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
xushiyan commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1257172243

   @YannByron please put details in the JIRA and PR description properly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1257217923

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0c3c94864a6523f0a72156cae3be1a3a70e54ef3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
YannByron commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1258869683

   @alexeykudinkin  yep, i forget to add UT for this.
   
   this is a case that trigger this bug i mentioned (modified from TestAvroSerDe)
   
   ```
     def testAvroUnionSerDe(): Unit = {
       val originalAvroRecord = {
         val minValue = new GenericData.Record(IntWrapper.SCHEMA$)
         minValue.put("value", 9)
         val maxValue = new GenericData.Record(IntWrapper.SCHEMA$)
         maxValue.put("value", 10)
   
         val record = new GenericData.Record(HoodieMetadataColumnStats.SCHEMA$)
         record.put("fileName", "9388c460-4ace-4274-9a0b-d44606af60af-0_2-25-35_20220520154514641.parquet")
         record.put("columnName", "c8")
         record.put("minValue", minValue)
         record.put("maxValue", maxValue)
         record.put("valueCount", 10L)
         record.put("nullCount", 0L)
         record.put("totalSize", 94L)
         record.put("totalUncompressedSize", 54L)
         record.put("isDeleted", false)
         record
       }
       val originalAvroRecord2 = {
         val minValue = new GenericData.Record(IntWrapper.SCHEMA$)
         minValue.put("value", 9)
         val maxValue = new GenericData.Record(IntWrapper.SCHEMA$)
         maxValue.put("value", 10)
   
         val record = new GenericData.Record(HoodieMetadataColumnStats.SCHEMA$)
         record.put("fileName", "9388c460-4ace-4274-9a0b-d44606af60af-0_2-25-35_20220520154514641.parquet")
         record.put("columnName", "c8")
         record.put("minValue", minValue)
         record.put("maxValue", maxValue)
         record.put("valueCount", 10L)
         record.put("nullCount", 0L)
         record.put("totalSize", 94L)
         record.put("totalUncompressedSize", 55L) // only change this field.
         record.put("isDeleted", false)
         record
       }
   
       val avroSchema = HoodieMetadataColumnStats.SCHEMA$
       val SchemaType(catalystSchema, _) = SchemaConverters.toSqlType(avroSchema)
   
       val deserializer = sparkAdapter.createAvroDeserializer(avroSchema, catalystSchema)
       val serializer = sparkAdapter.createAvroSerializer(catalystSchema, avroSchema, nullable = false)
   
       val row = deserializer.deserialize(originalAvroRecord).get
       val row2 = deserializer.deserialize(originalAvroRecord2).get // deserialize originalAvroRecord2
       assert(row != row2) // without this pr, row and row2 are the same object.
       val deserializedAvroRecord = serializer.serialize(row)
   
       assertEquals(originalAvroRecord, deserializedAvroRecord)
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1257218654

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713",
       "triggerID" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0c3c94864a6523f0a72156cae3be1a3a70e54ef3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1257178088

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713",
       "triggerID" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0c3c94864a6523f0a72156cae3be1a3a70e54ef3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1257176339

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0c3c94864a6523f0a72156cae3be1a3a70e54ef3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1258756038

   @YannByron can you please pinpoint in the code where you think the issue is?
   
   Also, if you're addressing an issue can we please add a regression test to make sure it's actually addressed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1257220355

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713",
       "triggerID" : "0c3c94864a6523f0a72156cae3be1a3a70e54ef3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0c3c94864a6523f0a72156cae3be1a3a70e54ef3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan merged pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
xushiyan merged PR #6788:
URL: https://github.com/apache/hudi/pull/6788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6788: [HUDI-4915] improve avro serializer/deserializer

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6788:
URL: https://github.com/apache/hudi/pull/6788#issuecomment-1258943919

   @YannByron that's exactly the problem: 
   ```
   val row = deserializer.deserialize(originalAvroRecord).get
   val row2 = deserializer.deserialize(originalAvroRecord2).get // deserialize originalAvroRecord2
   assert(row != row2) // without this pr, row and row2 are the same object.
   ```
   
   You're retaining reference to an internal Row object returned by `AvroDeserializer`. This is something Spark is unequivocally clear about: if you want to retain a reference to a row you _have to_ copy it, since row might be stored in reusable buffer (not copying it you might be holding a reference to a buffer that will be overwritten by subsequent invocation)
   
   This is unfortunately might be a mistake that is too easy to make w/ Spark, but harder to trace. We just need to be vigilant about it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org