You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/11 03:24:54 UTC

[GitHub] [hudi] gubinjie opened a new issue, #6914: [SUPPORT]Unable to merge duplicate data

gubinjie opened a new issue, #6914:
URL: https://github.com/apache/hudi/issues/6914

   Hudi:0.10.1
   
   Here is my script for a hudi table created via Flink:
   `create table paat_hudi_flink_tyc_company
   (
       company_id                string,
       company_name              string,
       update_time               timestamp(3) comment 'Last update time',
       primary key (company_id) not enforced
   ) with (
         'connector' = 'hudi',
         'path' = 'hdfs://******/user/hudi/warehouse/paat_ods_hudi.db/',
         'hoodie.datasource.write.recordkey.field' = 'company_id', 'write.precombine.field' = 'update_time', 'write.tasks' = '1',
         'compaction.tasks' = '1', 'write.rate.limit' = '2000', 'table.type' = 'MERGE_ON_READ',
         'compaction.async.enable' = 'true', 'compaction.trigger.strategy' = 'num_or_time',
         'compaction.max_memory' = '1024', 'changelog.enable' = 'true', 'read.streaming.enable' = 'true',
         'read.streaming.check-interval' = '30', 'hive_sync.enable' = 'true', 'hive_sync.mode' = 'hms',
         'hive_sync.metastore.uris' = 'thrift://******:9083', 'hive_sync.jdbc_url' = 'jdbc:hive2://******:10000',
         'hive_sync.table' = '******', 'hive_sync.db' = '******', 'hive_sync.username' = '******',
         'hive_sync.password' = '******', 'hive_sync.support_timestamp' = 'true'
         );`
   hoodie.datasource.write.recordkey.field is company_id.
   write.precombine.field is update_time
   When the same data (_hoodie_commit_time is different, update_time is the same) is inserted into Hudi in different time periods, duplicate data will appear.
   may I ask if this is normal?
   
   The following is the specific data:
   _hoodie_commit_time|_hoodie_commit_seqno       |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                   |company_id |company_name        |update_time  |
   -------------------|---------------------------|------------------|----------------------|--------------------------------------------------------------------|-----------|--------------------|-------------|
   20221010210058847  |20221010210058847_0_2734992|10000000558       |                      |5652dad0-9e32-43f5-99c4-eff0a89c6a79_0-1-5_20221010210058847.parquet|10000000558|*****|1665435515000|
   20221011094337486  |20221011094337486_0_5590349|10000000558       |                      |4f90e72d-d205-4640-975f-09ebb2ad136a_0-1-0_20221011094337486.parquet|10000000558|*****|1665435515000|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #6914: [SUPPORT]Unable to merge duplicate data

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6914:
URL: https://github.com/apache/hudi/issues/6914#issuecomment-1276886504

   @gubinjie How did you write the records to the Hudi table?  You need to use upsert operation instead of insert operation to merge the records with the same record key.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] gubinjie closed issue #6914: [SUPPORT]Unable to merge duplicate data

Posted by GitBox <gi...@apache.org>.
gubinjie closed issue #6914: [SUPPORT]Unable to merge duplicate data
URL: https://github.com/apache/hudi/issues/6914


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6914: [SUPPORT]Unable to merge duplicate data

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6914:
URL: https://github.com/apache/hudi/issues/6914#issuecomment-1283266883

   @gubinjie : gentle ping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org