You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/06/14 17:00:38 UTC

[GitHub] [hudi] tandonraghav opened a new issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

tandonraghav opened a new issue #3078:
URL: https://github.com/apache/hudi/issues/3078


   
   **Describe the problem you faced**
   
   We have a stream of partial records (Mongo CDC oplogs) and want to update the keys which are passed in partial records, and keeping the other values intact.
   
   A sample record in our CDC Kafka:-
   
   ````{"op":"u","doc_id":"606ffc3c10f9138e2a6b6csdc","shard_id":"shard01","ts":{"$timestamp":{"i":1,"t":1617951883}},"db_name":"test","collection":"Users","o":{"$v":1,"$set":{"key2":"value2"}},"o2":{"_id":{"$oid":"606ffc3c10f9138e2a6b6csdc"}}}````
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a custom `PAYLOAD_CLASS_OPT_KEY` as described below.
   2. Push some records (partial records) using Spark DF and persist in Hudi.
   3. Evolve the schema, add a new field to the existing Schema and persist via Spark DF
   4. `combineAndGetUpdateValue` is not getting called when Schema is evolved, which is making other values as NULL, as only partial record is getting passed and combine logic is present in custom class. However, this behaviour is not observed when Schema remains constant.
   
   **Expected behavior**
   
   Irrespective of Schema evolution, when compaction happens it should always go through `combineAndGetUpdateValue` of the class provided.
   
   **Environment Description**
   
   * Hudi version : 0.9.0-SNAPSHOT
   
   * Spark version : 2.4
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Custom Payload class-
   
   ````
   
   
   import org.apache.avro.Schema;
   import org.apache.avro.generic.GenericRecord;
   import org.apache.avro.generic.IndexedRecord;
   import org.apache.hudi.avro.HoodieAvroUtils;
   import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
   import org.apache.hudi.common.util.Option;
   
   import java.io.IOException;
   import java.util.List;
   import java.util.Properties;
   
   public class MongoHudiCDCPayload extends OverwriteWithLatestAvroPayload {
   
   
       public MongoHudiCDCPayload(GenericRecord record, Comparable orderingVal) {
           super(record, orderingVal);
       }
   
       public MongoHudiCDCPayload(Option<GenericRecord> record) {
           super(record);
       }
   
       @Override
       public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, Properties properties)
               throws IOException {
           if (this.recordBytes.length == 0) {
               return Option.empty();
           }
           GenericRecord incomingRecord = HoodieAvroUtils.bytesToAvro(this.recordBytes, schema);
           GenericRecord currentRecord = (GenericRecord)currentValue;
   
           List<Schema.Field> fields = incomingRecord.getSchema().getFields();
           fields.forEach((field)->{
               Object value = incomingRecord.get(field.name());
               if(value!=null)
                   currentRecord.put(field.name(), value);
           });
            return Option.of(currentRecord);
           }
   
   }
   ````
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] fanaticjo edited a comment on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
fanaticjo edited a comment on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-870828089


   > @fanaticjo As per the above comments, you should also implement `preCombine` method.
   > And also do u faced the same problem while Schema Evolution? or something else?
   
   I faced the  problem only in schema evolution where a new column is coming its filled as null 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] guanziyue edited a comment on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
guanziyue edited a comment on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866499977


   Hi tandonraghav
   I did some similar work before. Hope my experience could help you.
   First, as nanash mentioned before, we may call precombine method in two cases. First is dedup in ingestion. Second is in compaction.
   In compaction process, we first read log file use schema stored in log block to construct generic record and then have generic record transfer into payload. Then put them into a map. When we find duplicate key (yes they are ingested in different commit), we call precombine to combine two records with same key. This process is similar to hashJoin in spark. Finally, we got a map of payload which all key are unique. After that, we read record from parquet, use schema user provided in config to construct indexedRecord and call combinAndGetUpdateValue to merge payload in map and data from parquet.
   As you mentioned, it may not find schema in precombine. Could you please hold a reference to the schema in GenericRecord when payload is constructed as an attribute of class MongoHudiCDCPayload ? Then you can use schema in precombine method. And you may find that schema in avro 1.8.2 is not serializable, mark this attribute as transient may be a good idea. However, this may lead to schema lost in ingestion, as there is shuffle of payload in ingestion. You could recreate schema from properties arg in precombine when ingestion. This props is actually write config of hoodie. 
   Note that you may not always get schema from config, try this when schema is null may be a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
codope commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-905425366


   @tandonraghav @fanaticjo Can we close this? Are there any pending questions after @n3nash's  comments? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghav commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghav commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-870826823


   @fanaticjo As per the above comments, you should also implement `preCombine` method.
   And also do u faced the same problem while Schema Evolution? or something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghav commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghav commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-864611953


   @n3nash I am not able to resolve this. `preCombine` can be called even if there are no duplicates in the incoming records?
   Can you please explain with a sample ?
   
   I fail to understand how will i combine the existing record if it doesn't call the method `combineAndGetUpdateValue` as nowhere else i have ref to `IndexedRecord currentValue`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-865418114


   @tandonraghav Let me explain in details
   
   1. `preCombine` -> This is used in the following code paths a) In-Memory deduping/merging of incoming records. The logic in preCombine decides how 2 records with the same record key will be deduped b) On-Disk deduping/merging of incoming records. The same logic in preCombine decides how 2 records will the same record key in log files will be merged/deduped. 
   2. `combineAndGetUpdateValue` -> This is used to merge the in-memory record with the one on disk. 
   
   Ideally, you want to keep the merging logic of in-memory vs on-disk the same. Let's take the following use-case - You are ingesting 100 records per batch. Let's say out of those 100 records, 2 have the same record key. Now, if all the 100 records were part of the same batch, you would probably apply preCombine to dedup - whether in-memory or in log files. But what if the 2 records came in different batches. Now, you will apply `combineAndGetUpdateValue` to merge the 2 records. If your behavior is not the same in both the implementations, you can get different results.
   
   The reason to keep both of these API's different was to provide more flexibility. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar closed issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
vinothchandar closed issue #3078:
URL: https://github.com/apache/hudi/issues/3078


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghavs commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghavs commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-869211052


   @guanziyue Well, in my testing `combinAndGetUpdateValue` in not getting called at all. I am not worried about how compaction is happening for delta records. 
   Do you have a reference implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-926238426


   Closing due to inactivity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] fanaticjo commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
fanaticjo commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-869828896


   @n3nash  This same problem is there in my pull request also https://github.com/apache/hudi/pull/3035 
   
   I also wanted a way to solve this . i was thinking internally if a new column is added add that in the hoodie.update.keys 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] fanaticjo commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
fanaticjo commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-869828896


   @n3nash  This same problem is there in my pull request also https://github.com/apache/hudi/pull/3035 
   
   I also wanted a way to solve this . i was thinking internally if a new column is added add that in the hoodie.update.keys 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] fanaticjo commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
fanaticjo commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-870833491


   > @fanaticjo ok, but you should implement `preCombine` also, as the logic should be uniform, might be that is the reason?
   
   okay , will implement preCombine also 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] guanziyue commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
guanziyue commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866499977


   Hi tandonraghav
   I did some similar work before. Hope my experience could help you.
   First, as nanash mentioned before, we may call precombine method in two cases. First is dedup in ingestion. Second is in compaction.
   In compaction process, we first read log file use schema stored in log block to construct generic record and then have generic record transfer into payload. Then put them into a map. When we find duplicate key (yes they are ingested in different commit), we call precombine to combine all records with same key. This process is similar to hashJoin in spark. Finally, we got a map of payload which all key are unique. After that, we read record from parquet, use schema user provided in config to construct indexedRecord and call combinAndGetUpdateValue to merge payload in map and data from parquet.
   As you mentioned, it may not find schema in precombine. Could you please hold a reference to the schema in GenericRecord when payload is constructed as an attribute of class MongoHudiCDCPayload ? Then you can use schema in precombine method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-862008843


   @tandonraghav To expect the payload semantics for compaction, you need to override the `preCombine` implementation and have the same implementation across `combineAndGetUpdateValue` and `preCombine`. The idea is that whether it is 2 records in memory, 2 records on disk or 1 record in memory vs 1 on disk -> we call either `preCombine` or `combineAndGetUpdateValue` depending on what can be constructed as a payload. 
   
   Please try overriding preCombine and let me know if you still see problems. We might open a documentation around this to make it more clear. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghav commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghav commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-869211898


   @guanziyue Well, in my testing combinAndGetUpdateValue in not getting called at all. I am not worried about how compaction is happening for delta records.
   Do you have a reference implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghav commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghav commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-870831328


   @fanaticjo ok, but you should implement `preCombine` also, as the logic should be uniform, might be that is the reason?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-869213421


   @guanziyue Thanks for the detailed explanation.
   
   @tandonraghav During compaction, `combineAndGetUpdateValue` is never called. See this to get an understanding of how compaction works -> https://github.com/apache/hudi/blob/e99a6b031bf4f2e3037d4cb5307d443cda2d2002/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java#L132. 
   
   The issue is you want a common implementation across both `preCombine` and `combinAndGetUpdateValue`. The way I would recommend doing this is as follows : 
   
   class YourPayloadImplementation {
   
   Map<String, String> allTheCommonValuesNeededToDetermineHowToMerge2Records
   
   `private Boolean commonMergeLogic(allTheCommonValuesNeededToDetermineHowToMerge2Records old, allTheCommonValuesNeededToDetermineHowToMerge2Records new) {
      <your common implementation>
   }`
   
   `preCombine(YourPayloadImplementation that)
   Use allTheCommonValuesNeededToDetermineHowToMerge2Records to merge 2 different records
   boolean pickNew commonMergeLogic(this.allTheCommonValuesNeededToDetermineHowToMerge2Records, that.allTheCommonValuesNeededToDetermineHowToMerge2Records);
   if (pickNew) {
   ..
   }
   ..
   }`
   
   `combinAndGetUpdateValue(GenericRecord old, Schema schema) {
   <create datastructure allTheCommonValuesNeededToDetermineHowToMerge2Records from old and new as follows>
   allTheCommonValuesNeededToDetermineHowToMerge2RecordsOld = extractValuesfromOld(old);
   allTheCommonValuesNeededToDetermineHowToMerge2RecordsNew = extractValuesfromNew(getInsertValue(schema));
   boolean pickNew = commonMergeLogic(allTheCommonValuesNeededToDetermineHowToMerge2RecordsOld, allTheCommonValuesNeededToDetermineHowToMerge2RecordsNew);
   if (pickNew) {
   ..
   }
   ..
   }`
   
   Another way is to pass the Schema through the properties file and convert everything to GenericRecord and then there is no need for allTheCommonValuesNeededToDetermineHowToMerge2Records.
   You method would look as follows : 
   
   class YourPayloadImplementation {
   
   `private Boolean commonMergeLogic(GenericRecord old, GenericRecord new) {
      <your common implementation>
   }`
   
   `preCombine(YourPayloadImplementation that, Properties p)
   Schema schema = SchemaParse.newSchema(p.getString("schema"))
   boolean pickNew commonMergeLogic(getInsertValue(schema), that.getData().getInsertValue(schema);
   if (pickNew) {
   ..
   }
   ..
   }`
   
   `combinAndGetUpdateValue(GenericRecord old, Schema schema) {
   boolean pickNew = commonMergeLogic(old, getInsertValue(schema));
   if (pickNew) {
   ..
   }
   ..
   }`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghav edited a comment on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghav edited a comment on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866269573


   @n3nash I am trying to implement exactly `https://github.com/apache/hudi/pull/2106/files` , in this also `PartialAvroPayload` class only implements `combineAndGetUpdateValue` .
   Also, i do not have `Schema` in preCombine method. 
   Also, i am saying if the record preexists in (.parquet) and if i add a new field in incoming records (log file .avro files) then on compaction `combineAndGetUpdateValue` is not at all getting called but this doesnt happen if the schema is unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghav commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghav commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-862168089


   @n3nash I thought `preCombine` is only used to dedup incoming records.
   Then why `OverwriteNonDefaultsWithLatestAvroPayload` is not overriding `preCombine`? 
   As well as in this [PR](https://github.com/apache/hudi/pull/3035/files)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghav commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghav commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866269573


   @n3nash I am trying to implement exactly `https://github.com/apache/hudi/pull/2106/files` , in this also `PartialAvroPayload` class only implements `combineAndGetUpdateValue` .
   Also, i do not have `Schema` in preCombine method. 
   Also, i am saying if the record preexists in (.parquet) and if i add the field in incoming records (log file .avro files) then on compaction `combineAndGetUpdateValue` is not at all getting called but this doesnt happen if the schema is unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] guanziyue edited a comment on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
guanziyue edited a comment on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866499977


   Hi tandonraghav
   I did some similar work before. Hope my experience could help you.
   First, as nanash mentioned before, we may call precombine method in two cases. First is dedup in ingestion. Second is in compaction.
   In compaction process, we first read log file use schema stored in log block to construct generic record and then have generic record transfer into payload. Then put them into a map. When we find duplicate key (yes they are ingested in different commit), we call precombine to combine two records with same key. This process is similar to hashJoin in spark. Finally, we got a map of payload which all key are unique. After that, we read record from parquet, use schema user provided in config to construct indexedRecord and call combinAndGetUpdateValue to merge payload in map and data from parquet.
   As you mentioned, it may not find schema in precombine. Could you please hold a reference to the schema in GenericRecord when payload is constructed as an attribute of class MongoHudiCDCPayload ? Then you can use schema in precombine method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-865418114


   @tandonraghav Let me explain in details
   
   1. `preCombine` -> This is used in the following code paths a) In-Memory deduping/merging of incoming records. The logic in preCombine decides how 2 records with the same record key will be deduped b) On-Disk deduping/merging of incoming records. The same logic in preCombine decides how 2 records will the same record key in log files will be merged/deduped. 
   2. `combineAndGetUpdateValue` -> This is used to merge the in-memory record with the one on disk. 
   
   Ideally, you want to keep the merging logic of in-memory vs on-disk the same. Let's take the following use-case - You are ingesting 100 records per batch. Let's say out of those 100 records, 2 have the same record key. Now, if all the 100 records were part of the same batch, you would probably apply preCombine to dedup - whether in-memory or in log files. But what if the 2 records came in different batches. Now, you will apply `combineAndGetUpdateValue` to merge the 2 records. If your behavior is not the same in both the implementations, you can get different results.
   
   The reason to keep both of these API's different was to provide more flexibility. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-918431721


   @tandonraghav @fanaticjo : we have some work going on wrt schema evolution in the community. If you any pending issues, do let us know. We can definitely look into fixing it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tandonraghavs removed a comment on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
tandonraghavs removed a comment on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-869211052


   @guanziyue Well, in my testing `combinAndGetUpdateValue` in not getting called at all. I am not worried about how compaction is happening for delta records. 
   Do you have a reference implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] fanaticjo commented on issue #3078: [SUPPORT] combineAndGetUpdateValue is not getting called when Schema evolution happens

Posted by GitBox <gi...@apache.org>.
fanaticjo commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-870828089


   > @fanaticjo As per the above comments, you should also implement `preCombine` method.
   > And also do u faced the same problem while Schema Evolution? or something else?
   
   I faced the only problem only in schema evolution where a new column is coming its filled as null 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org