You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/24 11:25:50 UTC

[GitHub] [hudi] SabyasachiDasTR commented on issue #4311: Duplicate Records in Merge on Read [SUPPORT]

SabyasachiDasTR commented on issue #4311:
URL: https://github.com/apache/hudi/issues/4311#issuecomment-1019997016

Hi @nsivabalan & @prashantwason ,

Adding to the original issue reported by @JohnEngelhart and @sunilknataraj
I am from same organization and mentioning the same issue.

Regarding the possible reasons for duplicates
1. We are not using bulk_insert, but only upsert .[PFA Upsert query]
2. No multi writer is involved.
3. hoodiecombinebeforeupsert is default true.

Below are the recent findings
We could replicate the issue for a new dataset with below hoodie configurations, when upserting to a new table.[PFA]
"hoodie.index.type" -> "SIMPLE",
"hoodie.metadata.enable" -> "true",

Below combination are not producing duplicates when upserting to a new table.
Index metadata
SIMPLE FALSE
BLOOM FALSE
BLOOM TRUE

However when we are using same table which had duplicates originally and updating the Index & metadata configuration,
these combinations are still causing duplicates with latest consumed data.
Our data can be of small size and large size datasets and have a high incremental updates.
As suggested we updated Index = BLOOM and metadata = false.
Observed no duplicate for a new table and fresh dataset but duplicates are created on same table for which issue was reported.
Compactions is working as expected with inline type but we are seeing a lot of log files generated in partition table along with the data parquet files.

Deleting existing table and Re-ingesting the whole data may be an option to evaluate but is costly for us.
Please suggest if any way possible to get rid of the existing duplicates and avoid ingesting duplicates for existing table data.

[PFA]
1. .hoodie files attached
2. hudiOptions used.
3. Upsert query.
[hudiOptions.txt](https://github.com/apache/hudi/files/7925096/hudiOptions.txt)
[upsertQuery.txt](https://github.com/apache/hudi/files/7925097/upsertQuery.txt)

[hoodie_folder_SIMPLE_META_Enabled.zip](https://github.com/apache/hudi/files/7925088/hoodie_folder_SIMPLE_META_Enabled.zip)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org