You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Balaji Varadarajan (Jira)" <ji...@apache.org> on 2020/10/14 00:32:00 UTC

[jira] [Updated] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion

     [ https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Balaji Varadarajan updated HUDI-1343:
-------------------------------------
    Status: Open  (was: New)

> Add standard schema postprocessor which would rewrite the schema using spark-avro conversion
> --------------------------------------------------------------------------------------------
>
>                 Key: HUDI-1343
>                 URL: https://issues.apache.org/jira/browse/HUDI-1343
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: DeltaStreamer
>            Reporter: Balaji Varadarajan
>            Priority: Major
>
> When we use Transformer, the final Schema which we use to convert avro record to bytes is auto generated by spark. This could be different (due to the way Avro treats it) from the target schema that is being used to write (as the target schema could be coming from Schema Registry). 
>  
> For example : 
> Schema generated by spark-avro when converting Row to avro
> {
>   "type" : "record",
>   "name" : "hoodie_source",
>   "namespace" : "hoodie.source",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "long", "null" ]
>   }, {
>     "name" : "_op",
>     "type" : "string"
>   }, {
>     "name" : "inc_id",
>     "type" : "int"
>   }, {
>     "name" : "year",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "violation_code",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "flag",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long"
>   } ]
> }
>  
> is not compatible with the Avro Schema:
>  
> {
>   "type" : "record",
>   "name" : "formatted_debezium_payload",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "null", "long" ],
>     "default" : null
>   }, {
>     "name" : "_op",
>     "type" : "string",
>     "default" : null
>   }, {
>     "name" : "inc_id",
>     "type" : "int",
>     "default" : null
>   }, {
>     "name" : "year",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "violation_code",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "flag",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long",
>     "default" : null
>   } ]
> }
>  
> Note that the type order is different for individual fields : 
> "type" : [ "null", "string" ], vs  "type" : [ "string", "null" ]
> Unexpectedly, Avro decoding fails when bytes written with first schema is read using second schema.
>  
> One way to fix is to use configured target schema when generating record bytes but this is not easy without breaking Record payload constructor API used by deltastreamer. 
> The other option is to apply a post-processor on target schema to make it schema consistent with Transformer generated records.
>  
> This ticket is to use the later approach of creating a standard schema post-processor and adding it by default when Transformer is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)