You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/03 14:20:28 UTC

[GitHub] [hudi] dmenin opened a new issue #3394: [SUPPORT] Question on hudis default behaviour for UPSERT

dmenin opened a new issue #3394:
URL: https://github.com/apache/hudi/issues/3394


   Hello everyone.
   
   I have a quick question about hudi’s default behavior.
   I want to understand how UPSERT works for the same key in different scenarios.
   I am using ‘GLOBAL_SIMPLE’ index, which, from my understanding, tries to enforce uniqueness across all the partitions.
   
   The scenario is really straight forward: based on a timestamp, I want new data to be upserted and old data to be ignored. 
   The data on disk(S3) is partitioned by year\month\day so there are basically 4 scenarios:
   
   1) Inserting NEW data on the same partition
   2) Inserting NEW data on different partition
   3) Inserting OLD data on the same partition
   4) Inserting OLD data on different  partition
   
   
    
   Below is the result of the test on these scenarios.
   It is only one row with 4 columns.
   two keys (composite - always 100, 100)
   one description
   one timestamp (it becomes the partitions and its the sort key)
   
   
   Under “DB” you see the row that was on the database (the current state of the database);
   Under “Row In” you can see the row that was read from the file and issued to the insert statement and 
   under “Result” you see the result of the database after the insert.
   
   There are no headers, but the first two numbers (100 and 100) are the composite key, the string is the text and the datetime is the date of the row – which is converted to an integer (epoch) and used  as parameter for both "hoodie.datasource.write.precombine.field"   and ‘hoodie.payload.ordering.field'
    
    
   As you can see below, cases 1 and 2 that deal with NEWER data, update the new data - this is expected.
   
   Case 3, does not update the “older data” – see that the record on the DB was from 10AM and the new record was for 8AM – this is great, this also the behavior I want.
    
   But on case4, If I try to upsert older data that belong to an OLDER partition – it updated the row. This is weird, I would expect cases 3 and 4 to behave the same.
   
   Why does the partition of the data determines if the data is updated or not?
   Why did scenario 4 DELETED the data from partition 24 and inserted on 23 - I mean, its great that hudi only kept one copy of each key but why the different behaviour of scenario 3 and 4?
   This is all running in AWS Glue with hudi 0.7
    
   CASE 1 - Inserting NEW data on the same partition
   DB:
   100  100  three  2021-06-23 10:00:00
   Row In:
   100  100  same partition  2021-06-23 10:01:00
   Result (OK):
   100  100  same partition  2021-06-23 10:01:00
    
    
    
   CASE 2 - Inserting NEW data on different partition:
   DB:
   100              100  2021-06-23 10:01:00 same partition 
   Row In:
   100             100  2021-06-24 10:01:00  dif partition
   Result (OK):
   100              100  2021-06-24 10:01:00  dif partition
    
    
    
   CASE 3 - Inserting OLD data on the same partition
   DB:
   100              100  2021-06-24 10:01:00  dif partition
   Row In:
   100             100  2021-06-24 08:00:00  old data same partition
   Result (OK):
   100              100  2021-06-24 10:01:00  dif partition
    
    
    
   CASE 4 - Inserting OLD data on different  partition
   DB:
   100              100  2021-06-24 10:01:00  dif partition
   Row In:
   100             100  2021-06-23 09:00:00  old data dif partition
   Result (BAD):
   100              100  2021-06-23 09:00:00  old data dif partition
    
   
   I am attaching the code that I am using.
   Any help would be greatly apreciated.
   Thanks very  much
   
   [hudisample.txt](https://github.com/apache/hudi/files/6924753/hudisample.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] dmenin commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
dmenin commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-895381894


   ok, so bottom line, hudi doesn't have the concept of OLDER and NEWER in terms of row date (timestamp) - it only has NEW and OLD partition (where NEW corresponds to the data being upserted and OLD corresponds to the EXISTING partition of a particular key)
   If I want the behaviour I described, I probably have to implement myself? Have you been around this use case and can suggest a solution? (the simplest one I can imagine is to manually delete the data thats obsolete and only insert the new data - but to do that, I have to join the incoming data with the existing data and check the differences.... which may not perform in the long term).
   
   Thanks for your help so far.
   Diego


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-974819625


   can you try setting `hoodie.datasource.write.precombine.field`. It should get applied to `hoodie.payload.ordering.field`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-895327882


   yes, you are right. 
   rec1, pp1, v2, pc2 
   // here v2, pc2 represents the updated value. If not updated, it would have been v1, pc1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] maddy2u commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
maddy2u commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-926588291


   All good thanks Vinoth for yours and Sivabalan's support ! It can be closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudis default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-892056474


   yeah, with global index use-case, especially when there is a clash between two records just wrt partition path, depending on the [config value](https://hudi.apache.org/docs/configurations#bloomindexupdatepartitionpathupdatepartitionpath--false) set, hudi does either of these two. 
   a. delete existing storage record in old partition and insert to new partition
   or 
   b. update incoming record to same old partition (ignoring the new partition.
   
   Here hudi does not honor preCombine. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] dmenin commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
dmenin commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-892446743


   Hi Sivabalan,
   Thanks very much for taking the time to reply.
   A few questions:
   
   1) the config value you linked seem to only applies to GLOBAL_BLOOM index. I am using GLOBAL_SIMPLE, so I don’t think it applies to my case.
   
   2) You mentioned “preCombine”. My understanding is that preCombine works before the write…. If I have two records with the same key, preCombine will choose the one with largest value and the “submit” the insert command – so it shouldn’t affect the calculations between input data and existing data, correct? Since we only have 1 new row on the insert, it seems that preCombine is also not relevant in this case.
   
   Could you clarify further, please?
   Thanks,
   Diego
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-895408837


   There are some  nuances here. Ignore the global, different partitions for now. Just consider how to reconcile two records in general. 
   
   I guess you know what preCombine is used for (which is used to combine two records within same incoming batch of write). 
   But to reconcile an incoming record with one already on storage, Hudi relies on HoodieRecordPayload.combineAndGetUpdateValue(). 
   
   Most commonly used payload impl is OverwriteWithLatestAvroPayload. So, this will always choose the latest incoming record over whats in storage. 
   
   But recently we also added another payload impl called DefaultHoodieRecordPayload. This  payload will honor preCombine field while reconciling an incoming record with whats in storage using the preCombine field value(within combineAndGetUpdateValue()). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-991913326


   if you are good, can we close the issue out please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3394: [SUPPORT] Question on hudis default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-892056474


   yeah, with global index use-case, especially when there is a clash between two records just wrt partition path, depending on the [config value](https://hudi.apache.org/docs/configurations#bloomindexupdatepartitionpathupdatepartitionpath--false) set, hudi does either of these two. 
   a. delete existing storage record in old partition and insert to new partition
   or 
   b. update incoming record to same old partition (ignoring the new partition.
   
   In this flow hudi does not honor preCombine. 
   PreCombine will be honored when an updates happen. for eg (b) in above scenario. Or just regular updates where both record key and partition path matches
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-895408837


   There are some  nuances here. Ignoring the global, different partitions for now. Just consider how to reconcile two records in general. (in other words, there is only one partition and if a an update record is written to this partition where the record already exists in storage)
   
   I guess you know what preCombine is used for (which is used to combine two records within same incoming batch of write). 
   But to reconcile an incoming record with one already on storage, Hudi relies on HoodieRecordPayload.combineAndGetUpdateValue(). 
   
   Most commonly used payload impl is OverwriteWithLatestAvroPayload. So, this will always choose the latest incoming record over whats in storage. 
   
   But recently we also added another payload impl called DefaultHoodieRecordPayload. This  payload will honor preCombine field while reconciling an incoming record with whats in storage using the preCombine field value(within combineAndGetUpdateValue()). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] maddy2u commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
maddy2u commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-905846390


   Hi Sivabalan,
   
   I work with Diego on this topic.
   
   1. We use Hudi 0.7 for our processing and storing data in Hudi Format. Based on what you mentioned, my understanding is that the below statement would not be applicable for this version of Hudi. Is it available in 0.8 or please correct my assumption? How do we enable us to use precombine field while reconciling an incoming record? Any edge scenarios that we must be aware of ?
   
   > But recently we also added another payload impl called DefaultHoodieRecordPayload. This payload will honor preCombine field while reconciling an incoming record with whats in storage using the preCombine field value(within combineAndGetUpdateValue()).
   
   
   Summarizing the discussion from this thread - 
   
   1. Hudi will always treat the new data coming in as the data that needs to overwrite. The data is always updated based on the new data that is coming in (implemented in OverwriteWithLatestAvroPayload)
   2. Depending on hoodie.simple.index.update.partition.path = true/false, the data will be updated in the old or new partitions.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] dmenin commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
dmenin commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-895017446


   Yes it does, thanks for clarifying.
   
   I guess the confusion (mainly from my part) was that when I was said NEW and OLD, I was referring to the timestamp of the row (translated to the partition). For example, since my data is partitioned by year\month\day a record on the partition 23 is OLDER than one on the partition 24 - and my use case is that, if a record from 23 is submitted to huddi when the same record on partition 24 exists, it should be ignored - which didnt happen on use case 4 above - but now I understand why. I have "hoodie.simple.index.update.partition.path = true", which honoured the new partition type.
   
   Just to confirm, on your first example (with hoodie.simple.index.update.partition.path = false), the partition path is ignored but the data is updated, correct?
   
   Thanks,
   Diego
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ChaladiMohanVamsi commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
ChaladiMohanVamsi commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-926697926


   @nsivabalan 
   But recently we also added another payload impl called DefaultHoodieRecordPayload. This payload will honor preCombine field while reconciling an incoming record with whats in storage using the preCombine field value(within combineAndGetUpdateValue()).
   
   I have a confusion on similar lines. Can you please clarify and correct my understanding.
   
   How are following config differ in DefaultHoodieRecordPayload, which config will it choose to select record.
   
   1. hoodie.payload.ordering.field
   2. hoodie.datasource.write.precombine.field
   
   With the same payload class is there a possibility to disable precombine during deduplicating in same incremental batch but allow deciding whether or not to update existing record.
   
   I tried DefaultHoodiePayload class with
   hoodi.combine.before.insert=false and not providing precombine field but has payload.ordering.field.
   
   In this scenario it thrower an error of missing precombine field column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-905187503


   @dmenin : hey, let me know if you need any more info. Will wait for couple of days and will close this out if we don't hear back. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-894482014


   1. sorry, looks like we missed to update our config page. 
   "hoodie.simple.index.update.partition.path" is the one for simple index. 
   
   2. Let me try to illustrate w/ simple example.
   
   Format: 
   record key, partition path, col1, preCombine
   
   insert:
   rec1, pp1, v1, pc1
   rec2, pp2, v1, pc1
   
   both records will be inserted into hudi table. 
   data in hudi table
   rec1, pp1, v1, pc1
   rec2, pp2, v1, pc1
   
   Now, lets see what happens if some overlapping records are ingested with hoodie.simple.index.update.partition.path = false. records will always be routed to old partition if found in hudi table. 
   
   new writes:
   rec1, pp2, v2, pc2
   rec3, pp2, v2, pc2
   
   Once committed, this is what data in hudi table looks like
   
   rec1, pp1, v2, pc2 // new partition path ignored. 
   rec2, pp2, v1, pc1
   rec3, pp2, v2, pc2
   
   
   Now, let's see what happens if some overlapping records are ingested with hoodie.simple.index.update.partition.path = true. records will always be routed to old partition if found in hudi table. 
   
   data in hudi table
   rec1, pp1, v1, pc1
   rec2, pp2, v1, pc1
   
   new writes:
   rec1, pp2, v2, pc2
   rec3, pp2, v2, pc2
   
   Once committed, this is what data in hudi table looks like
   
   rec1, pp2, v2, pc2 // new partition path honored. 
   rec1, pp1, v1, pc1 : deleted.  
   rec2, pp2, v1, pc1
   rec3, pp2, v2, pc2
   
   Bottom line with global type index, is record keys are unique across entire data set (irrespective of partitionpath)
   
   Let me know if this is clear. 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3394: [SUPPORT] Question on hudis default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-892056474


   yeah, with global index use-case, especially when there is a clash between two records just wrt partition path, depending on the [config value](https://hudi.apache.org/docs/configurations#bloomindexupdatepartitionpathupdatepartitionpath--false) set, hudi does either of these two. 
   a. delete existing storage record in old partition and insert to new partition
   or 
   b. update incoming record to same old partition (ignoring the new partition.
   
   In this flow hudi does not honor preCombine. 
   PreCombine will be honored when a updates happen. for eg (b) in above scenario.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3394:
URL: https://github.com/apache/hudi/issues/3394


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-991913445


   feel free to reopen if need be. would be happy to help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-912848915


   1. To dedup records within the same incoming batch, you need to enable these configs. 
   https://hudi.apache.org/docs/configurations#hoodiecombinebeforeupsert
   https://hudi.apache.org/docs/configurations#hoodiecombinebeforeinsert
   
   In this case, payload impl does not matter. 
   
   2. yes, you can try using DefaultHoodieRecordPayload. It is available as part of 0.8.0. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-926273464


   @maddy2u any more updates on this issue ? or can we close this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] maddy2u commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
maddy2u commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-914042334


   Thank you Sivabalan ! Appreciate your support. We will come back with updates shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] maddy2u edited a comment on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
maddy2u edited a comment on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-905846390


   Hi Sivabalan,
   
   I work with Diego on this topic and I have one question regarding your response - 
   
   1. We use Hudi 0.7 on AWS Glue for processing and storing data. Based on what you mentioned, my understanding is that the below statement would not be applicable for this version of Hudi. Is it available in 0.8 or please correct my assumption? How do we enable us to use precombine field while reconciling an incoming record? Any edge scenarios that we must be aware of ?
   
   > But recently we also added another payload impl called DefaultHoodieRecordPayload. This payload will honor preCombine field while reconciling an incoming record with whats in storage using the preCombine field value(within combineAndGetUpdateValue()).
   
   
   Summarizing the discussion from this thread - 
   
   1. Hudi will always treat the new data coming in as the data that needs to overwrite. The data is always updated based on the new data that is coming in (implemented in OverwriteWithLatestAvroPayload)
   2. Depending on hoodie.simple.index.update.partition.path = true/false, the data will be updated in the old or new partitions.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3394: [SUPPORT] Question on hudi's default behaviour for UPSERT

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3394:
URL: https://github.com/apache/hudi/issues/3394#issuecomment-974819625


   can you try setting `hoodie.datasource.write.precombine.field`. It should get applied to `hoodie.payload.ordering.field`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org