You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/11 03:24:54 UTC
[GitHub] [hudi] gubinjie opened a new issue, #6914: [SUPPORT]Unable to merge duplicate data
gubinjie opened a new issue, #6914:
URL: https://github.com/apache/hudi/issues/6914
Hudi:0.10.1
Here is my script for a hudi table created via Flink:
`create table paat_hudi_flink_tyc_company
(
company_id string,
company_name string,
update_time timestamp(3) comment 'Last update time',
primary key (company_id) not enforced
) with (
'connector' = 'hudi',
'path' = 'hdfs://******/user/hudi/warehouse/paat_ods_hudi.db/',
'hoodie.datasource.write.recordkey.field' = 'company_id', 'write.precombine.field' = 'update_time', 'write.tasks' = '1',
'compaction.tasks' = '1', 'write.rate.limit' = '2000', 'table.type' = 'MERGE_ON_READ',
'compaction.async.enable' = 'true', 'compaction.trigger.strategy' = 'num_or_time',
'compaction.max_memory' = '1024', 'changelog.enable' = 'true', 'read.streaming.enable' = 'true',
'read.streaming.check-interval' = '30', 'hive_sync.enable' = 'true', 'hive_sync.mode' = 'hms',
'hive_sync.metastore.uris' = 'thrift://******:9083', 'hive_sync.jdbc_url' = 'jdbc:hive2://******:10000',
'hive_sync.table' = '******', 'hive_sync.db' = '******', 'hive_sync.username' = '******',
'hive_sync.password' = '******', 'hive_sync.support_timestamp' = 'true'
);`
hoodie.datasource.write.recordkey.field is company_id.
write.precombine.field is update_time
When the same data (_hoodie_commit_time is different, update_time is the same) is inserted into Hudi in different time periods, duplicate data will appear.
may I ask if this is normal?
The following is the specific data:
_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |company_id |company_name |update_time |
-------------------|---------------------------|------------------|----------------------|--------------------------------------------------------------------|-----------|--------------------|-------------|
20221010210058847 |20221010210058847_0_2734992|10000000558 | |5652dad0-9e32-43f5-99c4-eff0a89c6a79_0-1-5_20221010210058847.parquet|10000000558|*****|1665435515000|
20221011094337486 |20221011094337486_0_5590349|10000000558 | |4f90e72d-d205-4640-975f-09ebb2ad136a_0-1-0_20221011094337486.parquet|10000000558|*****|1665435515000|
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6914: [SUPPORT]Unable to merge duplicate data
Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6914:
URL: https://github.com/apache/hudi/issues/6914#issuecomment-1276886504
@gubinjie How did you write the records to the Hudi table? You need to use upsert operation instead of insert operation to merge the records with the same record key.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] gubinjie closed issue #6914: [SUPPORT]Unable to merge duplicate data
Posted by GitBox <gi...@apache.org>.
gubinjie closed issue #6914: [SUPPORT]Unable to merge duplicate data
URL: https://github.com/apache/hudi/issues/6914
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6914: [SUPPORT]Unable to merge duplicate data
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6914:
URL: https://github.com/apache/hudi/issues/6914#issuecomment-1283266883
@gubinjie : gentle ping.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org