You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/07 23:03:59 UTC

[GitHub] [hudi] sathyaprakashg commented on a change in pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

sathyaprakashg commented on a change in pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#discussion_r519230640



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala
##########
@@ -364,4 +366,40 @@ object AvroConversionHelper {
         }
     }
   }
+
+  /**
+   * Remove namespace from fixed field.
+   * org.apache.spark.sql.avro.SchemaConverters.toAvroType method adds namespace to fixed avro field
+   * https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L177
+   * So, we need to remove that namespace so that reader schema without namespace do not throw erorr like this one
+   * org.apache.avro.AvroTypeException: Found hoodie.source.hoodie_source.height.fixed, expecting fixed
+   *
+   * @param schema Schema from which namespace needs to be removed for fixed fields
+   * @return input schema with namespace removed for fixed fields, if any
+   */
+  def removeNamespaceFromFixedFields(schema: Schema): Schema  ={

Review comment:
       @n3nash @bvaradar I checked the three steps you mentioned and it works fine when the reader and writer schema has same set of fields (and writer schema has namespace in fixed field). 
   
   If reader schema has extra field then, this approach does not work. Here is an [example](https://gist.github.com/sathyaprakashg/f423291be7be6f9d96b9cb850fc72edf) that has extra field in reader schema and gives error.  When schema evolves, table schema (reader schema) may have more or less number of fields then writer schema(mor log file schema). So, if we have to implement this approach, then it would work only when schema is same (except the extra namespace information in writer schema). Please let me know how to handle this or correct me if approach i took is wrong.
   
   Just to recap, issue we are trying to solve is, in the existing code, when we write fixed avro field in mor log file, it gets written with extra namespace information in one of the flow (Transformation without userProvidedSchema) but not in other two flows and with this PR, extra namespace information will no longer be written. 
   
   Since this extra namespace information is written only in mor log file and not in parquet file, one possible solution for user to do is do compaction before running job with this upgraded version of hudi. Also, compaction is not mandatory for upgrading to this version but only needs to be done if they are having fixed field in schema and they were using Transformation without userProvidedSchema flow.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org