You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/08 19:44:00 UTC

[GitHub] [hudi] n3nash commented on pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

n3nash commented on pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#issuecomment-835486764


   At a high level, my understanding is there are 3 issues
   
   1. Removing namespace could cause backwards compatibility -> Validated with [this](https://github.com/apache/hudi/pull/2928) PR that it should be OK to make this change. 
   2. Records with older schema received without a user defined schema. This breaks the `getInsertRecord` API since the writer schema is smaller than the reader schema and bytes to generic record conversion breaks, addressed by @nsivabalan [diff](https://github.com/apache/hudi/pull/2927)
   3. For general schema evolution to work when converting from DataFrame to Avro, we need to ensure that new elements added in the schema have a default value "null" using the UNION type as follows `name: a, type ["null", <type>]`. Spark's internal converter breaks the Avro spec of creating a UNION by reversing the order of null as follows `name: a, type [<type>, "null"]` , @nsivabalan fixed this with [this](https://github.com/apache/hudi/pull/2765) PR by relying on Hudi's internal ConversionHelper to ensure that "null" is appended first. 
   
   With all of these 3 fixes, we should be able to land this PR. @sathyaprakashg Let us know if there were any other concerns that you came across. 
   
   @nsivabalan I can confirm that (3) should be good to land. I've approved the diff and @bvaradar will be taking a look at this later today, once he confirms, we can land both (3) and then (2) with the assumption that (1) validates both. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org