You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Pratyaksh Sharma (Jira)" <ji...@apache.org> on 2020/04/15 05:33:00 UTC
[jira] [Updated] (HUDI-796) Rewrite DedupeSparkJob.scala without
considering the _hoodie_commit_time
[ https://issues.apache.org/jira/browse/HUDI-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pratyaksh Sharma updated HUDI-796:
----------------------------------
Status: Open (was: New)
> Rewrite DedupeSparkJob.scala without considering the _hoodie_commit_time
> ------------------------------------------------------------------------
>
> Key: HUDI-796
> URL: https://issues.apache.org/jira/browse/HUDI-796
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Reporter: Pratyaksh Sharma
> Assignee: Pratyaksh Sharma
> Priority: Major
>
> _`_hoodie_commit_time` can only be used for deduping a partition path if duplicates happened due to INSERT operation. In case of updates, bloom filter tags both the files where a record is present for update, and all such files will have the same `___hoodie_commit_time__` for a duplicate record henceforth._
> _Hence it makes sense to rewrite this class without considering the metadata field._
--
This message was sent by Atlassian Jira
(v8.3.4#803005)