You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2021/10/18 13:25:00 UTC

[jira] [Comment Edited] (HUDI-2559) Ensure unique timestamps are generated for commit times with concurrent writers

    [ https://issues.apache.org/jira/browse/HUDI-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430012#comment-17430012 ] 

sivabalan narayanan edited comment on HUDI-2559 at 10/18/21, 1:24 PM:
----------------------------------------------------------------------

Here are the possible solutions:
 # add millisec level granularity to commit timestamp. [https://github.com/apache/hudi/pull/2701]
 # Add a per writer config name writerUniqueId in config and user is expected to set to unique string for every writer. Hudi does not depend on the actual timestamp format and does string based comparison for commit timestamp for any ordering in general. So, this should also work. for instance, as of today, commit timestamps are as below

20211015191547

If we add a unique writer id as suffix to this,

20211015191547-writer1

 

And so, even if two writers happened to start a new write concurrently, and even if same timestamp was generated, commit times will be as follows

20211015191547-writer1

20211015191547-writer2

 

Approach1:

Neat and elegant. very very unlikely, two writers will generate the same timestamp as timestamp need to match at millisec granularity. 

Approach2:

This also should work. If approach1 takes more time to develop or runs into any issues, this solution should be straight forward. we can think about releasing this as first version and go with approach1 later if need be. 

 

 

 

 


was (Author: shivnarayan):
Here are the possible solutions:
 # add millisec level granularity to commit timestamp. [https://github.com/apache/hudi/pull/2701]
 # Add a per writer config name writerUniqueId in config and user is expected to set to unique string for every writer. Hudi does not depend on the actual timestamp format and does string based comparison for commit timestamp for any ordering in general. So, this should also work. for instance, as of today, commit timestamps are as below

20211015191547

If we add a unique writer id as suffix to this,

20211015191547-writer1

 

And so, even if two writers happened to start a new write concurrently, and even if same timestamp was generated, commit times will be as follows

20211015191547-writer1

20211015191547-writer2

 

Approach1:

Neat and elegant. very very unlikely, two writers will generate the same timestamp as timestamp need to match at millisec granularity. 

Approach2:

This also should work. If approach1 takes more time or runs into any issues, this solution should be straight forward. we can think about releasing this as first version and go with approach1 later if need be. 

 

 

 

 

> Ensure unique timestamps are generated for commit times with concurrent writers
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-2559
>                 URL: https://issues.apache.org/jira/browse/HUDI-2559
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>
> Ensure unique timestamps are generated for commit times with concurrent writers.
> this is the piece of code in HoodieActiveTimeline which creates a new commit time.
> {code:java}
> public static String createNewInstantTime(long milliseconds) {
>   return lastInstantTime.updateAndGet((oldVal) -> {
>     String newCommitTime;
>     do {
>       newCommitTime = HoodieActiveTimeline.COMMIT_FORMATTER.format(new Date(System.currentTimeMillis() + milliseconds));
>     } while (HoodieTimeline.compareTimestamps(newCommitTime, LESSER_THAN_OR_EQUALS, oldVal));
>     return newCommitTime;
>   });
> }
> {code}
> There are chances that a deltastreamer and a concurrent spark ds writer gets same timestamp and one of them fails. 
> Related issues and github jiras: 
> [https://github.com/apache/hudi/issues/3782]
> https://issues.apache.org/jira/browse/HUDI-2549
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)