You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/06/23 03:51:13 UTC

[GitHub] [hudi] guanziyue edited a comment on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

guanziyue edited a comment on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866499977


   Hi tandonraghav
   I did some similar work before. Hope my experience could help you.
   First, as nanash mentioned before, we may call precombine method in two cases. First is dedup in ingestion. Second is in compaction.
   In compaction process, we first read log file use schema stored in log block to construct generic record and then have generic record transfer into payload. Then put them into a map. When we find duplicate key (yes they are ingested in different commit), we call precombine to combine two records with same key. This process is similar to hashJoin in spark. Finally, we got a map of payload which all key are unique. After that, we read record from parquet, use schema user provided in config to construct indexedRecord and call combinAndGetUpdateValue to merge payload in map and data from parquet.
   As you mentioned, it may not find schema in precombine. Could you please hold a reference to the schema in GenericRecord when payload is constructed as an attribute of class MongoHudiCDCPayload ? Then you can use schema in precombine method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org