You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/26 17:07:21 UTC

[GitHub] [hudi] alexeykudinkin commented on issue #5107: [SUPPORT] High performance costs of AvroSerializer in Datasource writing

alexeykudinkin commented on issue #5107:
URL: https://github.com/apache/hudi/issues/5107#issuecomment-1079734532


   Thanks for flagging this @YuweiXiao, great catch! 
   
   To summarize what the issue is here: it is unfortunately a very sneaky one and it occurred accidentally during the refactoring of AvroSerializer/Deserializer hierarchy in Hudi.
   
   Crux of the issue is that converter initializes AvroSerializer/Deserializer upon _every_ invocation of it, b/c it's done w/in the returned lambda itself (it also has a side-effect of pulling whole `SparkAdapter` into the closure):
   
   ```
   def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] =
       record => sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType)
         .deserialize(record)
         .map(_.asInstanceOf[InternalRow])
   ```
   
   Instead it should have been
   ```
   def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] = { 
     val deserilizer = sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType) 
     record => 
       deserializer.deserialize(record).map(_.asInstanceOf[InternalRow]) }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org