You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Raymond Xu (Jira)" <ji...@apache.org> on 2022/02/23 05:32:00 UTC

[jira] [Assigned] (HUDI-3355) Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy

     [ https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raymond Xu reassigned HUDI-3355:
--------------------------------

    Assignee: tao meng

> Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-3355
>                 URL: https://issues.apache.org/jira/browse/HUDI-3355
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Surya Prasanna Yalla
>            Assignee: tao meng
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> Out of order commits can happen between two commits C1 and C2. If timestamp of C2 is greater than C1's and completed before C1.
> In our use case, we are running clustering in async, and want ingestion writers to be given preference over clustering.
> Following are the configs used by the ingestion writer
> {noformat}
> "hoodie.clustering.updates.strategy": "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy"
> "hoodie.clustering.rollback.pending.replacecommit.on.conflict": false{noformat}
> This would allow ingestion writers to ignore pending replacecommits on the timeline and continue writing.
>  
> Consider the following scenario
> {code:java}
> At instant1
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.inflight -> Started 
> {code}
> {code:java}
> At instant2
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit -> Completed {code}
> {code:java}
> At instant3
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight -> Started{code}
> {code:java}
> At instant4
> C1.commit
> C2.commit
> C3.replacecommit -> Completed
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight(continuing) {code}
> {code:java}
> At instant5
> C1.commit
> C2.commit
> C3.replacecommit
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.commit -> Completed (It has conflict with C3 but since it has lower timestamp than C4, C3 is not considered during conflict resolution){code}
>  
> Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though the C3 is the one that is committed last.
> Ideally when sorting the timeline we should consider the transition times. So, timeline should look something like,
> {code:java}
> C1.commit
> C2.commit
> C4.commit(lastSuccessfulCommit seen by C5)
> C3.replacecommit 
> C5.inflight{code}
> So, in this case when the C5 is about to complete, it will consider all the commits that are completed after C4 which will be C3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)