You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/03 13:43:53 UTC

[GitHub] [hudi] nsivabalan commented on issue #4311: Duplicate Records in Merge on Read [SUPPORT]

nsivabalan commented on issue #4311:
URL: https://github.com/apache/hudi/issues/4311#issuecomment-1004101626

thanks for the timeline.
Here are our findings so far. unfortunately we don't have a root cause yet. but updating some findings for now.

1. We see same data file being cleaned up by multiple clean instants. we need to understand how this could be possible. Only reason I can see this happening, if a new clean is scheduled even before previously clean is completed and hence the scheduling of new clean adds the same file to be cleaned up that previously scheduled clean will also be attempting to clean.
2. We inspected the active timeline, yet to inspect the archive timeline. but from active timeline, could not decipher much except the repeated clean reported in (1). but that also does not answer why duplicates could occur.
3. Do you happened to know why rollbacks are kicking in? It could happen only if there are partial write failures and the next writer will try to rollback the old failed write before starting the new write.

Here are the reasons why duplicates could occur.
1. If any of your writer used bulk_insert instead of upsert.
2. Was there any multi writer involved at any point in time. Inserts from multiple concurrent writers could result in duplicates.
3. https://hudi.apache.org/docs/configurations/#hoodiecombinebeforeupsert is set to false during "UPSERT" operation. Or "INSERT" was used which does not do de-dup by default. So, if incoming batch has duplicates, it could result in storage having duplicates as well. your 3rd use-case might fall into this likely. (where duplicates have same commit time, but diff file names)

I see you are using SIMPLE index. Can you try using BLOOM index. Very unlikely index lookup has issues. just to rule that out.

Just something to try out if feasible. Can you read this table and upsert to a new table and see how that pans out? Atleast after initial bootstrap, you should not see duplicates. but if you tee your incremental ingestion to both tables, does the new table result in duplicates?

@prashantwason : Can you help us triage this issue please.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org