You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/24 11:25:50 UTC

[GitHub] [hudi] SabyasachiDasTR commented on issue #4311: Duplicate Records in Merge on Read [SUPPORT]

SabyasachiDasTR commented on issue #4311:
URL: https://github.com/apache/hudi/issues/4311#issuecomment-1019997016


   Hi @nsivabalan & @prashantwason ,
   
   Adding to the original issue reported by @JohnEngelhart  and @sunilknataraj 
   I am from same organization and mentioning the same issue.
   
   Regarding the possible reasons for duplicates
   1. We are not using bulk_insert, but only upsert .[PFA Upsert query]
   2. No multi writer is involved.
   3. hoodiecombinebeforeupsert is default true.
   
   Below are the recent findings
   We could replicate the issue for a new dataset with below hoodie configurations, when upserting to a new table.[PFA]
   "hoodie.index.type" -> "SIMPLE",
   "hoodie.metadata.enable" -> "true",
   
   Below combination are not producing duplicates when upserting to a new table.
   Index	metadata
   SIMPLE	FALSE
   BLOOM	FALSE	
   BLOOM	TRUE
   
   
   However when we are using same table which had duplicates originally and updating the Index & metadata configuration,
   these combinations are still causing duplicates with latest consumed data.
   Our data can be of small size and large size datasets and have a high incremental updates.
   As suggested we updated Index = BLOOM and metadata = false.
   Observed no duplicate for a new table and fresh dataset but duplicates are created on same table for which issue was reported.
   Compactions is working as expected with inline type but we are seeing a lot of log files generated in partition table along with the data parquet files.
   
   Deleting existing table and Re-ingesting the whole data may be an option to evaluate but is costly for us.
   Please suggest if any way possible to get rid of the existing duplicates and avoid ingesting duplicates for existing table data.
   
   [PFA]
   1. .hoodie files attached
   2. hudiOptions used.
   3. Upsert query.
   [hudiOptions.txt](https://github.com/apache/hudi/files/7925096/hudiOptions.txt)
   [upsertQuery.txt](https://github.com/apache/hudi/files/7925097/upsertQuery.txt)
   
   [hoodie_folder_SIMPLE_META_Enabled.zip](https://github.com/apache/hudi/files/7925088/hoodie_folder_SIMPLE_META_Enabled.zip)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org